deepthi912 opened a new pull request, #18599:
URL: https://github.com/apache/pinot/pull/18599

   Companion to apache#18590 (which reverts the matcher-level fix from #18579). 
This PR is the actual fix.
   
   Addresses @Jackie-Jiang's feedback on apache#18579: 
<https://github.com/apache/pinot/pull/18579#issuecomment-4546819452>
   
   > *"The correct fix should be within predicate evaluator to handle raw 
values even when dictionary exist"*
   
   ## Bug
   
   `regexp_like(col, ...)` and `LIKE` against a column with `encodingType: RAW` 
+ dictionary + IFST/FST + no inverted/sorted index throw 
`UnsupportedOperationException` at scan time. Common on iceberg/external-table 
layouts.
   
   Stack:
   ```
   BaseDictionaryBasedPredicateEvaluator.applySV(String):133  → throws
   SVScanDocIdIterator$StringMatcher.doesValueMatch:308
   ```
   
   ## Root cause
   
   `FilterPlanNode.case REGEXP_LIKE` unconditionally builds the IFST/FST 
dict-id evaluator whenever the corresponding text index exists. This bypasses 
the invariant `PredicateEvaluatorProvider.getDictionaryUsableForFiltering` 
already enforces for every other dict-based predicate type:
   
   ```java
   case REGEXP_LIKE:
     return invertedAvailable ? dictionary : null;   // drops dict if no 
dict-consuming op
   ```
   
   When no dict-consuming operator can be picked, 
`FilterOperatorUtils.getLeafFilterOperator` falls through to 
`ScanBasedFilterOperator`, which reads raw values and calls `applySV(String)` — 
a dict-id evaluator can't service that.
   
   ## Fix
   
   Reuse `PredicateEvaluatorProvider.getDictionaryUsableForFiltering` to decide 
whether the FST/IFST evaluator can be consumed. If yes, build it; otherwise 
fall through to the standard provider, which returns 
`RawValueBasedRegexpLikePredicateEvaluator` — already implements 
`applySV(String)` correctly.
   
   ```java
   boolean dictUsable = 
PredicateEvaluatorProvider.getDictionaryUsableForFiltering(
       dataSource, _queryContext, predicate) != null;
   if (caseInsensitive) {
     if (dataSource.getIFSTIndex() != null && dictUsable) {
       predicateEvaluator = 
IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(...);
     } else {
       predicateEvaluator = 
PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, 
_queryContext);
     }
   } else {
     if (dataSource.getFSTIndex() != null && dictUsable) {
       predicateEvaluator = 
FSTBasedRegexpPredicateEvaluatorFactory.newFSTBasedEvaluator(...);
     } else {
       predicateEvaluator = 
PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource, 
_queryContext);
     }
   }
   ```
   
   Also flips `getDictionaryUsableForFiltering` from package-private to public 
so `FilterPlanNode` (a different package) can call it.
   
   ## Behavior matrix
   
   | Forward index | FST/IFST | Sorted | Inverted | Evaluator chosen | Filter 
operator | Status |
   |---|---|---|---|---|---|---|
   | DICT | yes | no | no | FST/IFST eval | scan + `DictIdMatcher` | ✅ |
   | DICT | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ |
   | DICT | yes | yes | no | FST/IFST eval | SortedIndex | ✅ |
   | **RAW** | yes | no | no | raw eval | scan + `StringMatcher` | ✅ (was 
crashing) |
   | RAW | yes | no | yes | FST/IFST eval | InvertedIndex | ✅ |
   
   ## Test coverage
   
   `FilterPlanNodeTest` — five new tests covering each branch of the decision 
matrix using Mockito-mocked `DataSource`/`IndexSegment`. All assertions verify 
the evaluator type chosen by the planner (dict-id-based vs raw-value-based) for 
the relevant index combination.
   
   ## Relationship to apache#18590
   
   #18590 reverts the matcher-level fix from #18579. This PR adds the 
planner-level fix.
   
   Either order of merge works:
   - If #18590 lands first: master is back to the pre-#18579 crash, this PR 
fixes it.
   - If this PR lands first: the bug is fixed; #18590 then cleans up the 
now-dead matcher code from #18579.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to