deepthi912 opened a new pull request, #18589: URL: https://github.com/apache/pinot/pull/18589
Follow-up to #18579 per @Jackie-Jiang's review feedback: <https://github.com/apache/pinot/pull/18579#issuecomment-4546819452> > *"This is not the correct fix. It is way too expensive to lookup every single value within forward index. The correct fix should be within predicate evaluator to handle raw values even when dictionary exist"* #18579 routed raw-value scans through a per-row `dictionary.indexOf(value)` translation in `SVScanDocIdIterator`. This PR replaces that approach with a planner-level guard that avoids constructing a dict-id-based evaluator in the first place when no dict-consuming operator is available. ## Bug Even after #18579, `regexp_like(col, ...)` and `LIKE` against a RAW-encoded string column with a separate dictionary plus FST or IFST (and no inverted/sorted index) follow a suboptimal path: the IFST/FST evaluator is constructed (wasted FST automaton work) and then the scan does a per-row dict lookup to translate every value. This is what Jackie called out. When the FST/IFST evaluator cannot be consumed by an index-based filter operator, it shouldn't be constructed at all — the existing `RawValueBasedRegexpLikePredicateEvaluator` is the right choice and already implements `applySV(String)` correctly. ## Root cause `FilterPlanNode.case REGEXP_LIKE` unconditionally builds the IFST/FST dict-id evaluator when the corresponding index exists, bypassing the invariant that `PredicateEvaluatorProvider.getDictionaryUsableForFiltering` enforces for every other dict-based predicate type: ```java case REGEXP_LIKE: return invertedAvailable ? dictionary : null; // drops dict if no dict-consuming op ``` ## Fix Add the same invariant to the IFST/FST branches in `FilterPlanNode`: ```java if (caseInsensitive) { if (dataSource.getIFSTIndex() != null && canConsumeDictIdEvaluator(dataSource, _queryContext)) { predicateEvaluator = IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(...); } else { predicateEvaluator = PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, _queryContext); } } // mirror for case-sensitive FST branch private static boolean canConsumeDictIdEvaluator(DataSource dataSource, QueryContext queryContext) { if (dataSource.getDataSourceMetadata().isSorted() && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.SORTED)) { return true; } if (dataSource.getInvertedIndex() != null && queryContext.isIndexUseAllowed(dataSource, FieldConfig.IndexType.INVERTED)) { return true; } // Scan with DictIdMatcher can consume dict-id evaluators when the forward index is dict-encoded. ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex(); return forwardIndex != null && forwardIndex.isDictionaryEncoded(); } ``` The helper covers all three runtime consumers of a dict-id evaluator: 1. `SortedIndexBasedFilterOperator` 2. `InvertedIndexFilterOperator` 3. `ScanBasedFilterOperator` when the forward index is dict-encoded (uses `DictIdMatcher`, which calls `applySV(int dictId)`) When none of these will be picked, the IFST/FST evaluator is skipped and the raw-value evaluator is used instead — `RawValueBasedRegexpLikePredicateEvaluator` already implements `applySV(String)`. ## Behavior matrix | Forward index | FST/IFST | Sorted | Inverted | Evaluator chosen | Filter operator | Status | |---|---|---|---|---|---|---| | DICT | yes | no | no | FST/IFST eval | scan + `DictIdMatcher` | ✅ | | DICT | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ | | DICT | yes | yes | no | FST/IFST eval | SortedIndex | ✅ | | RAW | yes | no | no | raw eval | scan + `StringMatcher` | ✅ was crashing | | RAW | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ | ## Reverts This effectively reverts the matcher-level changes from #18579 (which Jackie identified as too expensive) by addressing the same crash at a higher layer with fewer lines and no public API change. ## What's NOT changed - `SVScanDocIdIterator` — reverted to upstream/master state (removes the matchers added in #18579). - `BaseDictionaryBasedPredicateEvaluator` — untouched. - `FilterOperatorUtils.getLeafFilterOperator` — untouched. - Concrete evaluator subclasses — untouched. ## Known limitation (separate follow-up) The `ScanBasedRegexpLikePredicateEvaluator` (used when dict ≥10K entries) extends `BaseDictionaryBasedPredicateEvaluator` directly, not `BaseDictIdBasedRegexpLikePredicateEvaluator`. `FilterOperatorUtils.getLeafFilterOperator` checks for the latter only, so on **RAW forward + dict ≥10K + inverted index** (no FST/IFST), the planner builds a `ScanBased` evaluator but the operator selector falls through to scan, producing the same crash. Out of scope for this PR; will file a separate fix to broaden the `instanceof` check. ## Test plan - [ ] Existing `PredicateEvaluatorProviderTest` should pass unchanged - [ ] Manual: `regexp_like(col, 'pat')`, `regexp_like(col, 'pat', 'i')`, and `LIKE 'pat%'` all return correct results (no crash) on a table with RAW forward + dict + FST/IFST + no inverted (the iceberg/external-table scenario that triggered #18579) - [ ] With inverted index added: queries still route to `InvertedIndexFilterOperator` (fast path) 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
