Re: [PR] Auto-create dictionary when secondary indexes need it (inverted, FST, range) [pinot]

via GitHub Wed, 06 May 2026 19:03:14 -0700


Jackie-Jiang commented on code in PR #17269:
URL: https://github.com/apache/pinot/pull/17269#discussion_r3198463880



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateEvaluatorProvider.java:
##########


Review Comment:
   Should we make this private? We should always do 
`getDictionaryUsableForFiltering()` to decide whether to propagate dictionary



##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/groupby/DefaultGroupByExecutor.java:
##########
@@ -87,7 +87,9 @@ public DefaultGroupByExecutor(QueryContext queryContext, 
AggregationFunction[] a
     for (ExpressionContext groupByExpression : groupByExpressions) {
       ColumnContext columnContext = 
projectOperator.getResultColumnContext(groupByExpression);
       hasMVGroupByExpression |= !columnContext.isSingleValue();
-      hasNoDictionaryGroupByExpression |= columnContext.getDictionary() == 
null;
+      hasNoDictionaryGroupByExpression |= columnContext.getDictionary() == null
+          || (columnContext.getDataSource() != null

Review Comment:
   (minor) Could datasource ever be null?



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateEvaluatorProvider.java:
##########
@@ -97,7 +100,70 @@ public static PredicateEvaluator 
getPredicateEvaluator(Predicate predicate, @Nul
 
   public static PredicateEvaluator getPredicateEvaluator(Predicate predicate, 
DataSource dataSource,
       QueryContext queryContext) {
-    return getPredicateEvaluator(predicate, dataSource.getDictionary(),
+    return getPredicateEvaluator(predicate, 
getDictionaryUsableForFiltering(dataSource, queryContext, predicate),
         dataSource.getDataSourceMetadata().getDataType(), queryContext);
   }
+
+  /// Returns the column dictionary if the planner can actually use it for 
filtering this specific predicate, otherwise
+  /// `null`.
+  ///
+  /// When the forward index is RAW, scan-based filtering reads raw values 
rather than dict IDs, so a dict-based
+  /// predicate evaluator (which only implements `applySV(int dictId)`) cannot 
be applied during scan. The
+  /// dictionary is only useful when the planner will pick a dict-consuming 
filter operator AND that operator is
+  /// enabled for this query. The decision must be made per-predicate-type 
because each predicate type uses a
+  /// different subset of dict-consuming operators (see 
`FilterOperatorUtils.getLeafFilterOperator`):
+  ///
+  /// - `RANGE`: sorted-index and range-index paths consume the dictionary. 
The range-index format is chosen at
+  ///   segment build time based on whether a dictionary exists 
(`RangeIndexType#createIndexCreator`) — when a
+  ///   dictionary is present, the range index stores dict IDs, even if the 
forward index is RAW.
+  /// - `REGEXP_LIKE`: sorted and inverted paths consume the dictionary (when 
a dict-id-based regex evaluator is
+  ///   built). FST/IFST handle their own evaluators upstream of this method.
+  /// - `EQ`: sorted, inverted, and exact-range paths consume the dictionary. 
Range-index serves EQ only when
+  ///   `isExact()` is true (see `RangeIndexBasedFilterOperator#canEvaluate`).
+  /// - `NOT_EQ / IN / NOT_IN`: only sorted and inverted paths consume the 
dictionary.
+  ///
+  /// If the forward index itself is missing (explicitly disabled), scan is 
impossible and dict-based eval is the
+  /// only option, so the dictionary is preserved regardless of predicate type.
+  @Nullable
+  public static Dictionary getDictionaryUsableForFiltering(DataSource 
dataSource, @Nullable QueryContext queryContext,
+      Predicate predicate) {
+    Dictionary dictionary = dataSource.getDictionary();
+    if (dictionary == null) {
+      return null;
+    }
+    ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex();
+    if (forwardIndex == null || forwardIndex.isDictionaryEncoded()) {
+      return dictionary;
+    }
+    // RAW forward index: keep the dictionary only if a dict-consuming filter 
operator is available AND enabled for
+    // this predicate type.
+    boolean sortedAvailable = dataSource.getDataSourceMetadata().isSorted()
+        && isIndexAllowedForQuery(queryContext, dataSource, 
FieldConfig.IndexType.SORTED);
+    boolean invertedAvailable = dataSource.getInvertedIndex() != null
+        && isIndexAllowedForQuery(queryContext, dataSource, 
FieldConfig.IndexType.INVERTED);
+    RangeIndexReader<?> rangeIndex = dataSource.getRangeIndex();
+    boolean rangeAvailable = rangeIndex != null
+        && isIndexAllowedForQuery(queryContext, dataSource, 
FieldConfig.IndexType.RANGE);
+    switch (predicate.getType()) {
+      case RANGE:
+        // Range index always serves RANGE predicates; when a dictionary 
exists the range index is built over dict
+        // IDs and requires the dict-based evaluator.
+        return (sortedAvailable || rangeAvailable) ? dictionary : null;

Review Comment:
   We need to ensure the predicate can solely be resolved with range index. 
Mixing dictionary encoded range index and raw forward index scan will break



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateEvaluatorProvider.java:
##########


Review Comment:
   Double check all callers of this, make sure it is not using forward index.
   Also double check the behavior of `TransformFunction.getDictionary()`. Is it 
dependent on whether forward index is dictionary encoded?



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateEvaluatorProvider.java:
##########
@@ -97,7 +100,70 @@ public static PredicateEvaluator 
getPredicateEvaluator(Predicate predicate, @Nul
 
   public static PredicateEvaluator getPredicateEvaluator(Predicate predicate, 
DataSource dataSource,
       QueryContext queryContext) {
-    return getPredicateEvaluator(predicate, dataSource.getDictionary(),
+    return getPredicateEvaluator(predicate, 
getDictionaryUsableForFiltering(dataSource, queryContext, predicate),
         dataSource.getDataSourceMetadata().getDataType(), queryContext);
   }
+
+  /// Returns the column dictionary if the planner can actually use it for 
filtering this specific predicate, otherwise
+  /// `null`.
+  ///
+  /// When the forward index is RAW, scan-based filtering reads raw values 
rather than dict IDs, so a dict-based
+  /// predicate evaluator (which only implements `applySV(int dictId)`) cannot 
be applied during scan. The
+  /// dictionary is only useful when the planner will pick a dict-consuming 
filter operator AND that operator is
+  /// enabled for this query. The decision must be made per-predicate-type 
because each predicate type uses a
+  /// different subset of dict-consuming operators (see 
`FilterOperatorUtils.getLeafFilterOperator`):
+  ///
+  /// - `RANGE`: sorted-index and range-index paths consume the dictionary. 
The range-index format is chosen at
+  ///   segment build time based on whether a dictionary exists 
(`RangeIndexType#createIndexCreator`) — when a
+  ///   dictionary is present, the range index stores dict IDs, even if the 
forward index is RAW.
+  /// - `REGEXP_LIKE`: sorted and inverted paths consume the dictionary (when 
a dict-id-based regex evaluator is
+  ///   built). FST/IFST handle their own evaluators upstream of this method.
+  /// - `EQ`: sorted, inverted, and exact-range paths consume the dictionary. 
Range-index serves EQ only when
+  ///   `isExact()` is true (see `RangeIndexBasedFilterOperator#canEvaluate`).
+  /// - `NOT_EQ / IN / NOT_IN`: only sorted and inverted paths consume the 
dictionary.
+  ///
+  /// If the forward index itself is missing (explicitly disabled), scan is 
impossible and dict-based eval is the
+  /// only option, so the dictionary is preserved regardless of predicate type.
+  @Nullable
+  public static Dictionary getDictionaryUsableForFiltering(DataSource 
dataSource, @Nullable QueryContext queryContext,

Review Comment:
   (minor) Should we make it package private and annotate with 
`@VisibleForTesting`?



##########
pinot-core/src/test/java/org/apache/pinot/core/data/manager/TableIndexingTest.java:
##########
@@ -305,6 +308,9 @@ public void testAddIndex(TestCase testCase) {
                 ]
              */
           indexTypes.add(FieldConfig.IndexType.FST);
+          if (encoding == FieldConfig.EncodingType.RAW) {

Review Comment:
   Do we need this? Dictionary should be generated on FST without explicit 
configuration



##########
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/predicate/PredicateEvaluatorProvider.java:
##########
@@ -97,7 +100,70 @@ public static PredicateEvaluator 
getPredicateEvaluator(Predicate predicate, @Nul
 
   public static PredicateEvaluator getPredicateEvaluator(Predicate predicate, 
DataSource dataSource,
       QueryContext queryContext) {
-    return getPredicateEvaluator(predicate, dataSource.getDictionary(),
+    return getPredicateEvaluator(predicate, 
getDictionaryUsableForFiltering(dataSource, queryContext, predicate),
         dataSource.getDataSourceMetadata().getDataType(), queryContext);
   }
+
+  /// Returns the column dictionary if the planner can actually use it for 
filtering this specific predicate, otherwise
+  /// `null`.
+  ///
+  /// When the forward index is RAW, scan-based filtering reads raw values 
rather than dict IDs, so a dict-based
+  /// predicate evaluator (which only implements `applySV(int dictId)`) cannot 
be applied during scan. The
+  /// dictionary is only useful when the planner will pick a dict-consuming 
filter operator AND that operator is
+  /// enabled for this query. The decision must be made per-predicate-type 
because each predicate type uses a
+  /// different subset of dict-consuming operators (see 
`FilterOperatorUtils.getLeafFilterOperator`):
+  ///
+  /// - `RANGE`: sorted-index and range-index paths consume the dictionary. 
The range-index format is chosen at
+  ///   segment build time based on whether a dictionary exists 
(`RangeIndexType#createIndexCreator`) — when a
+  ///   dictionary is present, the range index stores dict IDs, even if the 
forward index is RAW.
+  /// - `REGEXP_LIKE`: sorted and inverted paths consume the dictionary (when 
a dict-id-based regex evaluator is
+  ///   built). FST/IFST handle their own evaluators upstream of this method.
+  /// - `EQ`: sorted, inverted, and exact-range paths consume the dictionary. 
Range-index serves EQ only when
+  ///   `isExact()` is true (see `RangeIndexBasedFilterOperator#canEvaluate`).
+  /// - `NOT_EQ / IN / NOT_IN`: only sorted and inverted paths consume the 
dictionary.
+  ///
+  /// If the forward index itself is missing (explicitly disabled), scan is 
impossible and dict-based eval is the
+  /// only option, so the dictionary is preserved regardless of predicate type.
+  @Nullable
+  public static Dictionary getDictionaryUsableForFiltering(DataSource 
dataSource, @Nullable QueryContext queryContext,
+      Predicate predicate) {
+    Dictionary dictionary = dataSource.getDictionary();
+    if (dictionary == null) {
+      return null;
+    }
+    ForwardIndexReader<?> forwardIndex = dataSource.getForwardIndex();
+    if (forwardIndex == null || forwardIndex.isDictionaryEncoded()) {
+      return dictionary;
+    }
+    // RAW forward index: keep the dictionary only if a dict-consuming filter 
operator is available AND enabled for
+    // this predicate type.
+    boolean sortedAvailable = dataSource.getDataSourceMetadata().isSorted()

Review Comment:
   This is dead code. Sorted index is run-length dictionary encoded forward 
index.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/BaseSegmentCreator.java:
##########
@@ -370,10 +370,23 @@ private boolean 
createDictionaryForColumn(ColumnStatistics stats, SegmentGenerat
 
     String column = spec.getName();
     FieldIndexConfigs fieldIndexConfigs = 
config.getIndexConfigsByColName().get(column);
+    // Infer dictionary existence based on index
+    boolean dictionaryRequired = 
DictionaryIndexConfig.requiresDictionary(spec, fieldIndexConfigs);
     if 
(fieldIndexConfigs.getConfig(StandardIndexes.dictionary()).isDisabled()) {
+      if (dictionaryRequired) {
+        LOGGER.warn("Found indexes {} require dictionary, but dictionary is 
disabled explicitly.",

Review Comment:
   Will this cause failure during segment creation?
   Should we always create dictionary when any index requires it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Auto-create dictionary when secondary indexes need it (inverted, FST, range) [pinot]

Reply via email to