deepthi912 opened a new pull request, #18589:
URL: https://github.com/apache/pinot/pull/18589

   Follow-up to #18579 per @Jackie-Jiang's review feedback: 
<https://github.com/apache/pinot/pull/18579#issuecomment-4546819452>
   
   > *"This is not the correct fix. It is way too expensive to lookup every 
single value within forward index. The correct fix should be within predicate 
evaluator to handle raw values even when dictionary exist"*
   
   #18579 routed raw-value scans through a per-row `dictionary.indexOf(value)` 
translation in `SVScanDocIdIterator`. This PR replaces that approach with a 
planner-level guard that avoids constructing a dict-id-based evaluator in the 
first place when no dict-consuming operator is available.
   
   ## Bug
   
   Even after #18579, `regexp_like(col, ...)` and `LIKE` against a RAW-encoded 
string column with a separate dictionary plus FST or IFST (and no 
inverted/sorted index) follow a suboptimal path: the IFST/FST evaluator is 
constructed (wasted FST automaton work) and then the scan does a per-row dict 
lookup to translate every value. This is what Jackie called out.
   
   When the FST/IFST evaluator cannot be consumed by an index-based filter 
operator, it shouldn't be constructed at all — the existing 
`RawValueBasedRegexpLikePredicateEvaluator` is the right choice and already 
implements `applySV(String)` correctly.
   
   ## Root cause
   
   `FilterPlanNode.case REGEXP_LIKE` unconditionally builds the IFST/FST 
dict-id evaluator when the corresponding index exists, bypassing the invariant 
that `PredicateEvaluatorProvider.getDictionaryUsableForFiltering` enforces for 
every other dict-based predicate type:
   
   ```java
   case REGEXP_LIKE:
     return invertedAvailable ? dictionary : null;   // drops dict if no 
dict-consuming op
   ```
   
   ## Fix
   
   Add the same invariant to the IFST/FST branches in `FilterPlanNode`:
   
   ```java
   if (caseInsensitive) {
     if (dataSource.getIFSTIndex() != null
         && canConsumeDictIdEvaluator(dataSource, _queryContext)) {
       predicateEvaluator = 
IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(...);
     } else {
       predicateEvaluator = 
PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, 
_queryContext);
     }
   }
   // mirror for case-sensitive FST branch
   
   private static boolean canConsumeDictIdEvaluator(DataSource dataSource, 
QueryContext queryContext) {
     if (dataSource.getDataSourceMetadata().isSorted()
         && queryContext.isIndexUseAllowed(dataSource, 
FieldConfig.IndexType.SORTED)) {
       return true;
     }
     if (dataSource.getInvertedIndex() != null
         && queryContext.isIndexUseAllowed(dataSource, 
FieldConfig.IndexType.INVERTED)) {
       return true;
     }
     // Scan with DictIdMatcher can consume dict-id evaluators when the forward 
index is dict-encoded.
     ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex();
     return forwardIndex != null && forwardIndex.isDictionaryEncoded();
   }
   ```
   
   The helper covers all three runtime consumers of a dict-id evaluator:
   1. `SortedIndexBasedFilterOperator`
   2. `InvertedIndexFilterOperator`
   3. `ScanBasedFilterOperator` when the forward index is dict-encoded (uses 
`DictIdMatcher`, which calls `applySV(int dictId)`)
   
   When none of these will be picked, the IFST/FST evaluator is skipped and the 
raw-value evaluator is used instead — 
`RawValueBasedRegexpLikePredicateEvaluator` already implements 
`applySV(String)`.
   
   ## Behavior matrix
   
   | Forward index | FST/IFST | Sorted | Inverted | Evaluator chosen | Filter 
operator | Status |
   |---|---|---|---|---|---|---|
   | DICT | yes | no | no | FST/IFST eval | scan + `DictIdMatcher` | ✅ |
   | DICT | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ |
   | DICT | yes | yes | no | FST/IFST eval | SortedIndex | ✅ |
   | RAW | yes | no | no | raw eval | scan + `StringMatcher` | ✅ was crashing |
   | RAW | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ |
   
   ## Reverts
   
   This effectively reverts the matcher-level changes from #18579 (which Jackie 
identified as too expensive) by addressing the same crash at a higher layer 
with fewer lines and no public API change.
   
   ## What's NOT changed
   
   - `SVScanDocIdIterator` — reverted to upstream/master state (removes the 
matchers added in #18579).
   - `BaseDictionaryBasedPredicateEvaluator` — untouched.
   - `FilterOperatorUtils.getLeafFilterOperator` — untouched.
   - Concrete evaluator subclasses — untouched.
   
   ## Known limitation (separate follow-up)
   
   The `ScanBasedRegexpLikePredicateEvaluator` (used when dict ≥10K entries) 
extends `BaseDictionaryBasedPredicateEvaluator` directly, not 
`BaseDictIdBasedRegexpLikePredicateEvaluator`. 
`FilterOperatorUtils.getLeafFilterOperator` checks for the latter only, so on 
**RAW forward + dict ≥10K + inverted index** (no FST/IFST), the planner builds 
a `ScanBased` evaluator but the operator selector falls through to scan, 
producing the same crash. Out of scope for this PR; will file a separate fix to 
broaden the `instanceof` check.
   
   ## Test plan
   
   - [ ] Existing `PredicateEvaluatorProviderTest` should pass unchanged
   - [ ] Manual: `regexp_like(col, 'pat')`, `regexp_like(col, 'pat', 'i')`, and 
`LIKE 'pat%'` all return correct results (no crash) on a table with RAW forward 
+ dict + FST/IFST + no inverted (the iceberg/external-table scenario that 
triggered #18579)
   - [ ] With inverted index added: queries still route to 
`InvertedIndexFilterOperator` (fast path)
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to