xiangfu0 opened a new pull request, #18500:
URL: https://github.com/apache/pinot/pull/18500
## Summary
A column configured with `EncodingType.RAW` + `dictionaryIndex` has a
dictionary file on disk but a non-dict-encoded forward index.
`ColumnContext.fromDataSource` and `ProjectionBlockValSet.getDictionary`
exposed the dictionary unconditionally, so callers gating on `getDictionary()
!= null` then called `getDictionaryIdsSV/MV` — which routes to
`ForwardIndexReader.readDictIds` and throws `UnsupportedOperationException` on
a RAW forward index.
Reported as a multi-column GROUP BY crash on a S3 Parquet column configured
with RAW forward + dictionary:
```
Caused by: java.lang.UnsupportedOperationException
at
org.apache.pinot.segment.spi.index.reader.ForwardIndexReader.readDictIds(ForwardIndexReader.java:129)
at
org.apache.pinot.core.common.DataFetcher$ColumnValueReader.readDictIds(DataFetcher.java:353)
at
org.apache.pinot.core.common.DataFetcher.fetchDictIds(DataFetcher.java:105)
at
org.apache.pinot.core.common.DataBlockCache.getDictIdsForSVColumn(DataBlockCache.java:114)
at
org.apache.pinot.core.operator.docvalsets.ProjectionBlockValSet.getDictionaryIdsSV(ProjectionBlockValSet.java:105)
at
org.apache.pinot.core.query.aggregation.groupby.NoDictionaryMultiColumnGroupKeyGenerator.generateKeysForBlock(NoDictionaryMultiColumnGroupKeyGenerator.java:120)
```
The same buggy assumption ("dictionary exists implies dict-encoded forward
index") affects several distinct/aggregation paths beyond
`NoDictionaryMultiColumnGroupKeyGenerator`: `DistinctExecutorFactory`,
`SegmentPartitionedDistinctCount`, `AnyValue`, `ModeAggregation`, the
`DistinctCount*` sketch aggregators, and `BaseDistinctAggregate*`. Fixing the
two source methods covers all of them.
## Fix
Apply the same gating already used by `IdentifierTransformFunction` and
`DataFetcher.addDataSource`: expose the dictionary only when
`ForwardIndexReader.isDictionaryEncoded()` is true. This matches the
`BlockValSet.getDictionary()` Javadoc contract: *null if the column is not
dictionary-encoded*.
After this change, `getDictionary() != null` correctly implies
`getDictionaryIdsSV/MV()` is callable, so every caller that already gates on
this idiom is correct.
## Test plan
- [x] Added `testMultiColumnGroupByWithRawDictColumnReturnsSameResults` in
`RawForwardIndexWithDictionaryTest`, exercising the exact
`NoDictionaryMultiColumnGroupKeyGenerator` path from the stack trace.
- [x] `pinot-core` and `pinot-integration-tests` compile cleanly.
- [x] Spotless, checkstyle, and license checks pass on both modules.
- [x] Existing unit tests pass (238 tests across
`DictionaryBasedGroupKeyGeneratorTest`, `NoDictionaryGroupKeyGeneratorTest`,
`DistinctQueriesTest`, `InterSegmentAggregation{Single,Multi}ValueQueriesTest`,
`InnerSegmentAggregation{Single,Multi}ValueQueriesTest`, `RangeQueriesTest`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]