[PR] Bound FST regexp traversal and fall back to scan when the limit is exceeded [pinot]

via GitHub Sun, 14 Jun 2026 22:28:10 -0700


xiangfu0 opened a new pull request, #18759:
URL: https://github.com/apache/pinot/pull/18759


   > **Stacked on #18754** — this branch includes that PR's commit (the FST 
memory/interruptibility fix) as a dependency. Review the second commit here; 
once #18754 merges, GitHub will narrow this diff automatically.
   
   ## Problem
   
   A `REGEXP_LIKE` backed by an **FST/IFST index** walks the query automaton 
over the FST. A broad pattern — anything with a leading `.*` — gives the prefix 
walk no pruning, so it visits a large fraction of the FST and allocates 
proportionally. Even after #18754 removed the retained-result overhead, a 
single broad query over a high-cardinality column still churns ~67 MB (FST) / 
~134 MB (IFST) of transient garbage; under concurrency that drives server heap 
OOMs. The DFS frontier (live set) is already bounded, but **allocation 
throughput is not**.
   
   A benchmark comparison (default `java.util.regex` scan engine, 500K keys × 
40 chars) shows that for broad patterns the dictionary scan is both faster and 
allocates ~190× less than the FST walk, while for selective prefix/fuzzy 
patterns the FST is ~1600× faster. So the right move is to **cap the FST work 
and fall back to scan only when the walk is actually broad**.
   
   ## Changes
   
   - **Cap the walk.** `RegexpMatcher` / `RegexpMatcherCaseInsensitive` take a 
`maxPaths` budget and throw `FSTTraversalLimitExceededException` once that many 
FST paths have been visited.
   - **Bounded reader API.** `TextIndexReader` gains a backward-compatible 
default `getDictIds(String, int maxTraversalPaths)` (the default delegates to 
the existing single-arg method, so other index readers are unaffected). The two 
Lucene readers override it and let the limit signal propagate unwrapped.
   - **Fall back to scan.** `PredicateEvaluatorProvider` passes the configured 
limit and, on `FSTTraversalLimitExceededException`, falls back to the existing 
dictionary-scan evaluator — correct, and typically far cheaper in memory for 
broad patterns. Selective patterns visit a small subtree, never trip the cap, 
and keep the FST's large speed advantage.
   - **Configurable.** New `fstRegexpTraversalLimit` query option, default 
**100,000** visited paths. A non-positive value disables the cap (unbounded), 
so a misconfigured option can't silently route every regexp to the scan 
fallback.
   
   No on-disk format change; the SPI addition is a default method 
(rolling-upgrade safe); the new query option is additive.
   
   ## Testing
   
   - **Matcher level** (`FSTBuilderTest`, `IFSTBuilderTest`): a broad walk 
throws at a tiny budget; a generous limit completes; the real default limit 
keeps a full walk on the FST path (guards against a default-shrink regression); 
a selective query stays under a modest budget.
   - **End-to-end** (`FSTBasedRegexpLikeQueriesTest`): a query with 
`fstRegexpTraversalLimit = 1` forces the scan fallback and returns results 
**identical** to the FST path (256 and 512 rows), proving fallback correctness.
   
   Reviewed independently (multi-agent): 0 critical; the two majors raised — 
unvalidated non-positive limit, and pinning the default — are addressed above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Bound FST regexp traversal and fall back to scan when the limit is exceeded [pinot]

Reply via email to