Re: [PR] Auto-create dictionary when required by secondary indexes (inverted, FST) [pinot]

via GitHub Mon, 06 Apr 2026 15:20:09 -0700


Copilot commented on code in PR #17269:
URL: https://github.com/apache/pinot/pull/17269#discussion_r3041632094



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java:
##########
@@ -61,6 +64,28 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, 
SegmentIndexCreatio
         segmentCreationSpec.getIndexConfigsByColName().size());
     initializeCommon(segmentCreationSpec, segmentIndexCreationInfo, 
indexCreationInfoMap,
         schema, outDir, colIndexes, immutableToMutableIdMap, instanceType);
+
+    // Although NullValueVector is implemented as an index, it needs to be 
treated in a different way than other indexes
+    // Process all field specs (not just indexed columns) for null value 
vectors
+    for (FieldSpec fieldSpec : schema.getAllFieldSpecs()) {
+      String columnName = fieldSpec.getName();
+      // Create null value vector for non-indexed nullable columns
+      if (!_colIndexes.containsKey(columnName)) {
+        if (isNullable(fieldSpec, schema, segmentCreationSpec)) {
+          // Initialize Null value vector map
+          LOGGER.info("Column: {} is nullable", columnName);
+          _colIndexes.put(columnName,
+              new ColumnIndexCreators(columnName, fieldSpec, null, List.of(),
+                  new NullValueVectorCreator(outDir, columnName)));
+        } else {
+          LOGGER.debug("Column: {} is not nullable", columnName);
+        }
+      }

Review Comment:
   Adding non-indexed nullable columns into _colIndexes means indexRow() will 
now iterate them. However, indexRow/indexColumnValue currently throws when 
row.getValue(...) returns null, before the null value vector can be set. This 
can introduce segment generation failures for nullable columns that are 
absent/null in the input. Either avoid adding such columns to _colIndexes for 
row-wise indexing, or handle nulls by setting the null vector and substituting 
the FieldSpec defaultNullValue (similar to the ColumnReader code path).



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java:
##########
@@ -61,6 +64,28 @@ public void init(SegmentGeneratorConfig segmentCreationSpec, 
SegmentIndexCreatio
         segmentCreationSpec.getIndexConfigsByColName().size());
     initializeCommon(segmentCreationSpec, segmentIndexCreationInfo, 
indexCreationInfoMap,
         schema, outDir, colIndexes, immutableToMutableIdMap, instanceType);
+
+    // Although NullValueVector is implemented as an index, it needs to be 
treated in a different way than other indexes
+    // Process all field specs (not just indexed columns) for null value 
vectors
+    for (FieldSpec fieldSpec : schema.getAllFieldSpecs()) {
+      String columnName = fieldSpec.getName();
+      // Create null value vector for non-indexed nullable columns
+      if (!_colIndexes.containsKey(columnName)) {
+        if (isNullable(fieldSpec, schema, segmentCreationSpec)) {
+          // Initialize Null value vector map
+          LOGGER.info("Column: {} is nullable", columnName);
+          _colIndexes.put(columnName,
+              new ColumnIndexCreators(columnName, fieldSpec, null, List.of(),
+                  new NullValueVectorCreator(outDir, columnName)));
+        } else {

Review Comment:
   The new INFO log ('Column: {} is nullable') will emit once per nullable, 
non-indexed column for every segment build. This is likely too noisy at INFO in 
production; consider lowering to DEBUG or logging once with a summary/count 
instead.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexCreatorFactory.java:
##########
@@ -45,15 +45,16 @@ public class ForwardIndexCreatorFactory {
   private ForwardIndexCreatorFactory() {
   }
 
-  public static ForwardIndexCreator createIndexCreator(IndexCreationContext 
context, ForwardIndexConfig indexConfig)
+  public static ForwardIndexCreator createIndexCreator(IndexCreationContext 
context)
       throws Exception {
+    ForwardIndexConfig indexConfig = context.getForwardIndexConfig();
     File indexDir = context.getIndexDir();
     FieldSpec fieldSpec = context.getFieldSpec();
     String columnName = fieldSpec.getName();
     int numTotalDocs = context.getTotalDocs();
-
-    if (context.hasDictionary()) {
-      // Dictionary enabled columns
+    if (context.hasDictionary()
+        && context.getForwardIndexEncoding() == 
IndexCreationContext.ForwardIndexEncoding.DICTIONARY) {

Review Comment:
   createIndexCreator() now pulls ForwardIndexConfig from 
context.getForwardIndexConfig(), but that field is nullable in the SPI and 
there’s no null-check here. If a caller forgets to set it (especially since 
ForwardIndexType now ignores the indexConfig parameter), this will NPE deep in 
the factory. Consider validating non-null upfront (with a clear message) or 
defaulting to ForwardIndexConfig.getDefault().



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/TableConfigUtils.java:
##########
@@ -1506,6 +1506,7 @@ private static void 
validateIndexingConfigAndFieldConfigList(TableConfig tableCo
         indexType.validate(indexConfigs, fieldSpec, tableConfig);
       }
     }
+    validateExplicitDictionaryForRawForwardIndex(tableConfig, schema, 
indexConfigsMap);
 

Review Comment:
   validateExplicitDictionaryForRawForwardIndex() is invoked during validation 
but is currently a no-op. This adds maintenance overhead and makes it unclear 
whether an invariant is supposed to be enforced. Either implement the intended 
validation (e.g., reject RAW forward columns with dictionary-required indexes 
unless dictionary is explicitly enabled) or remove the method/call until there 
is real logic.



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/SegmentPreProcessorTest.java:
##########
@@ -345,23 +346,39 @@ public void 
testEnableTextIndexOnNewColumnRaw(SegmentVersion segmentVersion)
     _fieldConfigMap.put(NEWLY_ADDED_STRING_COL_RAW,
         new FieldConfig(NEWLY_ADDED_STRING_COL_RAW, 
FieldConfig.EncodingType.RAW, List.of(FieldConfig.IndexType.TEXT),
             null, null));
+    // Text index creation for newly added raw columns may not fully succeed; 
verify pre-processing doesn't crash.
+    try {
+      checkTextIndexCreation(NEWLY_ADDED_STRING_COL_RAW, 1, 1, 
_newColumnsSchemaWithText, true, true, true, 4);
+    } catch (AssertionError | RuntimeException e) {
+      // Text index on newly added raw column may not be created correctly; 
this is a known limitation.
+    }
     _fieldConfigMap.put(NEWLY_ADDED_STRING_MV_COL_RAW,
         new FieldConfig(NEWLY_ADDED_STRING_MV_COL_RAW, 
FieldConfig.EncodingType.RAW,
             List.of(FieldConfig.IndexType.TEXT), null, null));
-    checkTextIndexCreation(NEWLY_ADDED_STRING_COL_RAW, 1, 1, 
_newColumnsSchemaWithText, true, true, true, 4);
-    checkTextIndexCreation(NEWLY_ADDED_STRING_MV_COL_RAW, 1, 1, 
_newColumnsSchemaWithText, true, true, false, 4, false,
-        1);
+    try {
+      checkTextIndexCreation(NEWLY_ADDED_STRING_MV_COL_RAW, 1, 1, 
_newColumnsSchemaWithText, true, true, false, 4,
+          false, 1);
+    } catch (AssertionError | RuntimeException e) {
+      // Text index on newly added raw column may not be created correctly; 
this is a known limitation.
+    }

Review Comment:
   These try/catch blocks swallow AssertionError/RuntimeException, so the test 
can pass even if text index creation regresses for newly added RAW columns. 
Prefer asserting a specific expected behavior (e.g., index absent but 
preprocessing completes), or explicitly mark/skip the scenario with a clear 
reason so failures aren’t silently ignored.



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/SegmentPreProcessorTest.java:
##########
@@ -1812,8 +1839,12 @@ public void 
testStarTreeCreationWithInvalidFunctionColumnPair()
             new IndexLoadingConfig(tableConfig, schema))) {
       // Should need processing due to new star-tree config
       assertTrue(processor.needProcess());
-      // Process should complete without throwing exception, but star-tree 
should not be created
-      processor.process(SEGMENT_OPERATIONS_THROTTLER);
+      // Process may fail with invalid function column pair; swallow to verify 
no star-tree is created
+      try {
+        processor.process(SEGMENT_OPERATIONS_THROTTLER);
+      } catch (IllegalStateException e) {
+        // expected for invalid function pair; proceed to validation
+      }

Review Comment:
   Catching and ignoring IllegalStateException here makes the test unable to 
detect unexpected failures (it will pass for any IllegalStateException). If the 
goal is to verify 'no star-tree created/updated' for invalid configs, assert on 
the exception message/type more narrowly (or use expectedExceptions) and still 
fail on other unexpected errors.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/BaseSegmentCreator.java:
##########
@@ -572,6 +604,7 @@ public static void 
addColumnMetadataInfo(PropertiesConfiguration properties, Str
     properties.setProperty(getKeyFor(column, COLUMN_TYPE), 
String.valueOf(fieldType));
     properties.setProperty(getKeyFor(column, IS_SORTED), 
String.valueOf(columnIndexCreationInfo.isSorted()));
     properties.setProperty(getKeyFor(column, HAS_DICTIONARY), 
String.valueOf(hasDictionary));

Review Comment:
   When a dictionary is created for secondary indexes while the forward index 
remains RAW, this code still writes HAS_DICTIONARY=true and never sets 
HAS_SHARED_DICTIONARY. That contradicts the backward-compat intent documented 
in V1Constants/ForwardIndexHandler and can cause older servers (that only look 
at HAS_DICTIONARY) to try to load a dict-encoded forward index file and fail. 
Consider writing HAS_DICTIONARY=false + HAS_SHARED_DICTIONARY=true whenever 
forwardIndexEncoding is RAW but a dictionary file is present, and keep 
FORWARD_INDEX_ENCODING as the source of truth for new nodes.
   ```suggestion
       // Older readers rely on HAS_DICTIONARY to infer forward-index encoding. 
When a dictionary exists only to support
       // secondary indexes and the forward index remains RAW, record it as a 
shared dictionary instead.
       boolean hasSharedDictionary =
           hasDictionary && forwardIndexEncoding == 
IndexCreationContext.ForwardIndexEncoding.RAW;
       boolean hasDictionaryForForwardIndex = hasDictionary && 
!hasSharedDictionary;
       properties.setProperty(getKeyFor(column, HAS_DICTIONARY), 
String.valueOf(hasDictionaryForForwardIndex));
       properties.setProperty(getKeyFor(column, HAS_SHARED_DICTIONARY), 
String.valueOf(hasSharedDictionary));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Auto-create dictionary when required by secondary indexes (inverted, FST) [pinot]

Reply via email to