deepthi912 opened a new pull request, #18599: URL: https://github.com/apache/pinot/pull/18599
Companion to apache#18590 (which reverts the matcher-level fix from #18579). This PR is the actual fix. Addresses @Jackie-Jiang's feedback on apache#18579: <https://github.com/apache/pinot/pull/18579#issuecomment-4546819452> > *"The correct fix should be within predicate evaluator to handle raw values even when dictionary exist"* ## Bug `regexp_like(col, ...)` and `LIKE` against a column with `encodingType: RAW` + dictionary + IFST/FST + no inverted/sorted index throw `UnsupportedOperationException` at scan time. Common on iceberg/external-table layouts. Stack: ``` BaseDictionaryBasedPredicateEvaluator.applySV(String):133 → throws SVScanDocIdIterator$StringMatcher.doesValueMatch:308 ``` ## Root cause `FilterPlanNode.case REGEXP_LIKE` unconditionally builds the IFST/FST dict-id evaluator whenever the corresponding text index exists. This bypasses the invariant `PredicateEvaluatorProvider.getDictionaryUsableForFiltering` already enforces for every other dict-based predicate type: ```java case REGEXP_LIKE: return invertedAvailable ? dictionary : null; // drops dict if no dict-consuming op ``` When no dict-consuming operator can be picked, `FilterOperatorUtils.getLeafFilterOperator` falls through to `ScanBasedFilterOperator`, which reads raw values and calls `applySV(String)` — a dict-id evaluator can't service that. ## Fix Reuse `PredicateEvaluatorProvider.getDictionaryUsableForFiltering` to decide whether the FST/IFST evaluator can be consumed. If yes, build it; otherwise fall through to the standard provider, which returns `RawValueBasedRegexpLikePredicateEvaluator` — already implements `applySV(String)` correctly. ```java boolean dictUsable = PredicateEvaluatorProvider.getDictionaryUsableForFiltering( dataSource, _queryContext, predicate) != null; if (caseInsensitive) { if (dataSource.getIFSTIndex() != null && dictUsable) { predicateEvaluator = IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(...); } else { predicateEvaluator = PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, _queryContext); } } else { if (dataSource.getFSTIndex() != null && dictUsable) { predicateEvaluator = FSTBasedRegexpPredicateEvaluatorFactory.newFSTBasedEvaluator(...); } else { predicateEvaluator = PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, _queryContext); } } ``` Also flips `getDictionaryUsableForFiltering` from package-private to public so `FilterPlanNode` (a different package) can call it. ## Behavior matrix | Forward index | FST/IFST | Sorted | Inverted | Evaluator chosen | Filter operator | Status | |---|---|---|---|---|---|---| | DICT | yes | no | no | FST/IFST eval | scan + `DictIdMatcher` | ✅ | | DICT | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ | | DICT | yes | yes | no | FST/IFST eval | SortedIndex | ✅ | | **RAW** | yes | no | no | raw eval | scan + `StringMatcher` | ✅ (was crashing) | | RAW | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ | ## Test coverage `FilterPlanNodeTest` — five new tests covering each branch of the decision matrix using Mockito-mocked `DataSource`/`IndexSegment`. All assertions verify the evaluator type chosen by the planner (dict-id-based vs raw-value-based) for the relevant index combination. ## Relationship to apache#18590 #18590 reverts the matcher-level fix from #18579. This PR adds the planner-level fix. Either order of merge works: - If #18590 lands first: master is back to the pre-#18579 crash, this PR fixes it. - If this PR lands first: the bug is fixed; #18590 then cleans up the now-dead matcher code from #18579. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
