deepthi912 commented on PR #18579:
URL: https://github.com/apache/pinot/pull/18579#issuecomment-4540507093
Confirmed on both points. Tracing concretely:
### How `_predicateEvaluator` is created
`FilterPlanNode.java:308-330` builds the dict-based evaluator before
`getLeafFilterOperator` is called:
```java
case REGEXP_LIKE:
boolean caseInsensitive = regexpLikePredicate.isCaseInsensitive();
if (caseInsensitive) {
if (dataSource.getIFSTIndex() != null) {
predicateEvaluator =
IFSTBasedRegexpPredicateEvaluatorFactory.newIFSTBasedEvaluator(
regexpLikePredicate, dataSource.getIFSTIndex(),
dataSource.getDictionary());
} else {
predicateEvaluator =
PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource,
_queryContext);
}
} else {
if (dataSource.getFSTIndex() != null) {
predicateEvaluator =
FSTBasedRegexpPredicateEvaluatorFactory.newFSTBasedEvaluator(...);
} else {
predicateEvaluator =
PredicateEvaluatorProvider.getPredicateEvaluator(predicate, dataSource,
_queryContext);
}
}
return FilterOperatorUtils.getLeafFilterOperator(_queryContext,
predicateEvaluator, dataSource, numDocs);
```
For the crash repro (case-insensitive + IFST), it produces an
`IFSTBasedRegexpPredicateEvaluator`. The class chain is:
```
IFSTBasedRegexpPredicateEvaluator
extends BaseDictIdBasedRegexpLikePredicateEvaluator
extends BaseDictionaryBasedPredicateEvaluator
```
So the `_predicateEvaluator instanceof
BaseDictionaryBasedPredicateEvaluator` guard in
`SVScanDocIdIterator.getValueMatcher()` matches.
### What `_predicateEvaluator.applySV(dictId)` does
From `IFSTBasedRegexpPredicateEvaluatorFactory.java:48-72`:
```java
private static class IFSTBasedRegexpPredicateEvaluator extends
BaseDictIdBasedRegexpLikePredicateEvaluator {
final ImmutableRoaringBitmap _matchingDictIdBitmap;
public IFSTBasedRegexpPredicateEvaluator(RegexpLikePredicate
regexpLikePredicate,
TextIndexReader ifstIndexReader, Dictionary dictionary) {
super(regexpLikePredicate, dictionary);
String searchQuery =
RegexpPatternConverterUtils.regexpLikeToLuceneRegExp(regexpLikePredicate.getValue());
_matchingDictIdBitmap = ifstIndexReader.getDictIds(searchQuery); // ←
IFST precomputes matching dict ids
...
}
@Override
public boolean applySV(int dictId) {
return _matchingDictIdBitmap.contains(dictId); // ← O(1) bitmap
membership check
}
}
```
So `applySV(int dictId)` is exactly the dict-id membership check against the
IFST-precomputed bitmap. The IFST consulted the dictionary **once** at
evaluator construction to enumerate matching dict ids; the matcher then does a
Roaring bitmap contains per row.
### Dictionary identity
The dictionary instance is the same object in both places:
- Evaluator construction: `dataSource.getDictionary()` passed to the factory
in `FilterPlanNode`.
- `DictLookupMatcher` runtime: `_dictionary = dataSource.getDictionary()`
captured in `SVScanDocIdIterator`'s constructor from the same `DataSource`.
So dict ids are consistent: a raw value `"abc"` translates to the same dict
id the IFST already bucketed at construction time. No risk of dict-id
divergence.
### Per-row flow with the patch
```
1. forwardIndex.getString(docId) → "abc" (RAW read)
2. dictionary.indexOf("abc") → 17 (separate
dict lookup)
3. _predicateEvaluator.applySV(17)
= IFSTBasedRegexpPredicateEvaluator.applySV(17)
= _matchingDictIdBitmap.contains(17) → true / false
(IFST-precomputed bitmap)
```
Step 3 is the dict-based evaluator, exclusively going through `applySV(int
dictId)`. `applySV(String)` is never called.
Same flow for `FSTBasedRegexpPredicateEvaluator` (case-sensitive FST),
`DictIdBasedRegexpLikePredicateEvaluator` (small-dict regex), and
`ScanBasedRegexpLikePredicateEvaluator` (large-dict regex) — all extend
`BaseDictionaryBasedPredicateEvaluator` and implement `applySV(int)`. The
matcher serves all of them.
The unit test in this PR (`SVScanDocIdIteratorTest`) verifies this contract
concretely: the stand-in evaluator inherits the default `applySV(String)` from
`BaseDictionaryBasedPredicateEvaluator`, which throws. If the matcher routing
ever regresses to `StringMatcher`, the test fails loudly with
`UnsupportedOperationException` — the exact production symptom.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]