Re: [PR] Auto-create dictionary when secondary indexes need it (inverted, FST, range) [pinot]

via GitHub Tue, 05 May 2026 13:16:52 -0700


xiangfu0 commented on code in PR #17269:
URL: https://github.com/apache/pinot/pull/17269#discussion_r3191286944



##########
pinot-core/src/main/java/org/apache/pinot/core/common/DataFetcher.java:
##########
@@ -345,13 +355,20 @@ private ForwardIndexReaderContext getReaderContext() {
 
     void readDictIds(int[] docIds, int length, int[] dictIdBuffer) {
       Tracing.activeRecording().setInputDataType(_storedType, _singleValue);
-      _reader.readDictIds(docIds, length, dictIdBuffer, getReaderContext());
+      ForwardIndexReaderContext readerContext = getReaderContext();
+      if (_useDictionary) {
+        _reader.readDictIds(docIds, length, dictIdBuffer, readerContext);
+      } else {
+        Preconditions.checkState(_dictionary != null, "Dictionary must be 
present for raw forward index");
+        Preconditions.checkState(_singleValue, "readDictIds(int[]) is 
single-value only; use readDictIdsMV for MV");
+        readDictIdsFromRawValues(docIds, length, dictIdBuffer, readerContext);

Review Comment:
   Agreed — switched to "explicit only" in 1b45308cc6 (and follow-up commit). 
The implicit fallback is gone: `readDictIds` / `readDictIdsMV` now throw 
`UnsupportedOperationException` on a RAW forward index instead of silently 
doing per-row dictionary lookups. Callers that genuinely need dict IDs on a 
shared-dict + RAW column must opt in explicitly via the new 
`DataFetcher#fetchDictIdsFromRawValues(...)` API (SV + MV), whose Javadoc flags 
it as `O(numDocs * Dictionary#indexOf)`.
   
   To make sure the throw doesn't break existing query paths, I also added 
`BlockValSet#isDictionaryEncoded()` (intentionally non-default so every impl 
has to declare its answer) and `ColumnContext#isDictionaryEncoded()`, then 
flipped every operator that previously gated on `dictionary != null` to gate on 
`isDictionaryEncoded()` instead — so for shared-dict + RAW columns these all 
route to the raw-value path:
   - `DefaultGroupByExecutor`, `NoDictionaryMultiColumnGroupKeyGenerator`
   - `DistinctExecutorFactory` (single- and multi-column DISTINCT)
   - `IdentifierTransformFunction` (its `TransformResultMetadata.hasDictionary` 
now reflects `isDictionaryEncoded()`)
   - All 
`DistinctCount[Bitmap/HLL/HLL+/SmartHLL/SmartHLL+/ULL/SmartULL/CPCSketch/OffHeap]`,
 `BaseDistinctAggregate`, `BaseDistinctCountSmartSketch`, 
`SegmentPartitionedDistinctCount`, `AnyValue`, `Mode`
   
   No production caller currently invokes `fetchDictIdsFromRawValues` — the API 
is reserved for cases where the operator needs exact dict-id semantics and has 
no correct raw-value alternative. The clearest future candidate is 
`DistinctCountBitmap` for non-INT types (its raw-value fallback uses 
`value.hashCode()` and is approximate due to collisions; the dict-id path is 
exact). Happy to wire that in a follow-up if you'd like, since it's an 
exactness-vs-cost trade and worth its own PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Auto-create dictionary when secondary indexes need it (inverted, FST, range) [pinot]

Reply via email to