xiangfu0 opened a new pull request, #18759: URL: https://github.com/apache/pinot/pull/18759
> **Stacked on #18754** — this branch includes that PR's commit (the FST memory/interruptibility fix) as a dependency. Review the second commit here; once #18754 merges, GitHub will narrow this diff automatically. ## Problem A `REGEXP_LIKE` backed by an **FST/IFST index** walks the query automaton over the FST. A broad pattern — anything with a leading `.*` — gives the prefix walk no pruning, so it visits a large fraction of the FST and allocates proportionally. Even after #18754 removed the retained-result overhead, a single broad query over a high-cardinality column still churns ~67 MB (FST) / ~134 MB (IFST) of transient garbage; under concurrency that drives server heap OOMs. The DFS frontier (live set) is already bounded, but **allocation throughput is not**. A benchmark comparison (default `java.util.regex` scan engine, 500K keys × 40 chars) shows that for broad patterns the dictionary scan is both faster and allocates ~190× less than the FST walk, while for selective prefix/fuzzy patterns the FST is ~1600× faster. So the right move is to **cap the FST work and fall back to scan only when the walk is actually broad**. ## Changes - **Cap the walk.** `RegexpMatcher` / `RegexpMatcherCaseInsensitive` take a `maxPaths` budget and throw `FSTTraversalLimitExceededException` once that many FST paths have been visited. - **Bounded reader API.** `TextIndexReader` gains a backward-compatible default `getDictIds(String, int maxTraversalPaths)` (the default delegates to the existing single-arg method, so other index readers are unaffected). The two Lucene readers override it and let the limit signal propagate unwrapped. - **Fall back to scan.** `PredicateEvaluatorProvider` passes the configured limit and, on `FSTTraversalLimitExceededException`, falls back to the existing dictionary-scan evaluator — correct, and typically far cheaper in memory for broad patterns. Selective patterns visit a small subtree, never trip the cap, and keep the FST's large speed advantage. - **Configurable.** New `fstRegexpTraversalLimit` query option, default **100,000** visited paths. A non-positive value disables the cap (unbounded), so a misconfigured option can't silently route every regexp to the scan fallback. No on-disk format change; the SPI addition is a default method (rolling-upgrade safe); the new query option is additive. ## Testing - **Matcher level** (`FSTBuilderTest`, `IFSTBuilderTest`): a broad walk throws at a tiny budget; a generous limit completes; the real default limit keeps a full walk on the FST path (guards against a default-shrink regression); a selective query stays under a modest budget. - **End-to-end** (`FSTBasedRegexpLikeQueriesTest`): a query with `fstRegexpTraversalLimit = 1` forces the scan fallback and returns results **identical** to the FST path (256 and 512 rows), proving fallback correctness. Reviewed independently (multi-agent): 0 critical; the two majors raised — unvalidated non-positive limit, and pinning the default — are addressed above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
