xiangfu0 opened a new pull request, #18504: URL: https://github.com/apache/pinot/pull/18504
## Summary A column declared with `EncodingType.RAW` + an explicit `dictionaryIndex` has a `Dictionary` file on disk but a RAW forward index that throws `UnsupportedOperationException` on `ForwardIndexReader#readDictIds`. Many aggregation, group-by, and distinct executors gated their dict-id read path on `blockValSet.getDictionary() != null` alone, so a single such column in a query crashes the operator. [apache#18500](https://github.com/apache/pinot/pull/18500) fixed one site (`NoDictionaryMultiColumnGroupKeyGenerator`). A wider scan surfaced ~30 more call sites with the same pattern. This PR makes the gate explicit at every site so the rule is single-sourced and consistently applied. ## Fix Introduce `boolean isDictionaryEncoded()` on `BlockValSet` and `ColumnContext` that returns `true` only when the forward index is dictionary-encoded. - `BlockValSet#isDictionaryEncoded()` — default implementation returns `getDictionary() != null`, so non-projection value sets (transform / row / data-block) keep working unchanged. - `ProjectionBlockValSet#isDictionaryEncoded()` — overrides to consult `ForwardIndexReader#isDictionaryEncoded()` directly, so `RAW + dictionaryIndex` columns correctly report `false`. - `ColumnContext#isDictionaryEncoded()` — computed at construction from the underlying `DataSource`. - `getDictionary()` keeps its straightforward "is there a dictionary file?" meaning. Filter operators that obtain the dictionary via `DataSource#getDictionary()` for value↔id lookup are unaffected. Every aggregation/distinct/group-by call site now gates on `isDictionaryEncoded()` rather than `getDictionary() != null`: | Site | Notes | |---|---| | `DefaultGroupByExecutor` | `!columnContext.isDictionaryEncoded()` routes to no-dict generator | | `NoDictionaryMultiColumnGroupKeyGenerator` | reuses the flag — same fix family as apache#18500 | | `DistinctExecutorFactory` | single- and multi-column paths | | `BaseDistinctAggregateAggregationFunction` | parent of `DISTINCTCOUNT`/`DISTINCTSUM`/`DISTINCTAVG`/MV variants | | `DistinctCountBitmap` / `HLL` / `HLLPlus` / `ULL` / `CPCSketch` / `OffHeap` | dict-id fast path | | `BaseDistinctCountSmartSketch` (`SmartHLL` / `SmartHLLPlus` / `SmartULL`) | dict-id fast path | | `SegmentPartitionedDistinctCount` | dict-id fast path | | `ModeAggregationFunction` | aggregate + SV/MV group-by | | `AnyValueAggregationFunction` | dict-id fast path | | `FUNNELCOUNT/AggregationStrategy` | precondition now correctly rejects RAW + dict columns with a friendly message instead of `UnsupportedOperationException` | ## Test plan - New tests in [`RawForwardIndexWithDictionaryTest`](pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/RawForwardIndexWithDictionaryTest.java) reproduce the crash for the representative paths: - `testMultiColumnGroupByWithRawDictColumnReturnsSameResults` - `testMultiColumnDistinctWithRawDictColumnReturnsSameResults` - `testDistinctWithFilterOnRawDictColumnReturnsSameResults` - `testDistinctCountWithFilterOnRawDictColumnReturnsSameResults` - `testDistinctCountHLLWithFilterOnRawDictColumnReturnsSameResults` - `testDistinctCountBitmapOnRawDictColumnReturnsSameResults` - `testSegmentPartitionedDistinctCountWithFilterOnRawDictColumnReturnsSameResults` - `testModeOnRawDictColumnReturnsSameResults` - All 16 runs (V1 + V2 engines) fail on master and pass with this change. - Filtered variants use a partial `WHERE` to bypass `NonScanBasedAggregationOperator`, which would otherwise serve the aggregation from the dictionary directly and hide the bug. - Existing regression test suite stays green: - 1566 unit tests across `*Distinct*` / `*Aggregation*` / `*GroupBy*` patterns - `DictionaryBasedGroupKeyGeneratorTest`, `NoDictionaryGroupKeyGeneratorTest`, `DistinctQueriesTest`, `DefaultAggregationExecutorTest` - Spotless, checkstyle, license — all clean on `pinot-core` and `pinot-integration-tests`. ## Backward compatibility Server-internal query-execution change only. No SPI surface change (`BlockValSet` lives in `pinot-core/common`, not in `pinot-spi`; the new method is a `default`). No config keys, wire formats, segment versions, REST endpoints, or DataTable formats are touched. Queries that previously threw `UnsupportedOperationException` now succeed via the value path; queries that previously succeeded keep their behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
