somandal commented on code in PR #10557:
URL: https://github.com/apache/pinot/pull/10557#discussion_r1159213930


##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexingConfig.java:
##########
@@ -31,6 +31,7 @@ public class IndexingConfig extends BaseJsonConfig {
    * This should be equal to the one specified in RangeIndexType.
    */
   private static final int DEFAULT_RANGE_INDEX_VERSION = 2;
+  public static final double DEFAULT_NO_DICTIONARY_SIZE_RATIO_THRESHOLD = 
0.85d;

Review Comment:
   nit: can you move this line above the `DEFAULT_RANGE_INDEX_VERSION` (since 
this is public and all the others are private)



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean 
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
     return false;
   }
 
+  /**
+   * This function evaluates whether to override dictionary (i.e use 
noDictionary)
+   * for a column even when its explicitly configured. This evaluation is for 
both dimension and metric
+   * column types.
+   * @return
+   */
+  public static boolean ignoreDictionaryOverride(boolean optimizeDictionary,
+      boolean optimizeDictionaryForMetrics, double 
noDictionarySizeRatioThreshold,
+      FieldSpec fieldSpec, FieldIndexConfigs fieldIndexConfigs, int 
colCardinality,
+      int totalNumberOfEntries) {
+    if (optimizeDictionary) {
+      // Do not create dictionaries for json or text index columns as they are 
high-cardinality values almost always
+      if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled() || 
fieldIndexConfigs.getConfig(
+          StandardIndexes.text()).isEnabled())) {
+        return false;
+      }
+      // Do not create dictionary if index size with dictionary is going to be 
larger than index size without dictionary
+      // This is done to reduce the cost of dictionary for high cardinality 
columns
+      // Off by default and needs optimizeDictionary to be set to true
+      if (fieldSpec.isSingleValueField() && 
fieldSpec.getDataType().isFixedWidth()) {
+        // if you can safely enable dictionary, you can ignore overrides
+        return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+            noDictionarySizeRatioThreshold, fieldSpec);
+      }
+    }
+
+    if (optimizeDictionaryForMetrics && !optimizeDictionary) {
+      if (fieldSpec.isSingleValueField() && 
fieldSpec.getDataType().isFixedWidth() && fieldSpec.getFieldType()
+          == FieldSpec.FieldType.METRIC) {
+        // if you can safely enable dictionary, you can ignore overrides
+        return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+            noDictionarySizeRatioThreshold, fieldSpec);
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Given the column cardinality, totalNumberOfEntries, this function checks 
if the savings ratio
+   * is larger than the configured threshold (noDictionarySizeRatioThreshold). 
If savings ratio is
+   * smaller than the threshold, we want to override to noDictionary.
+   * @return
+   */
+  private static boolean canSafelyEnableDictionary(int colCardinality, int 
totalNumberOfEntries,

Review Comment:
   nit: rename colCardinality -> cardinality



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean 
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
     return false;
   }
 
+  /**
+   * This function evaluates whether to override dictionary (i.e use 
noDictionary)
+   * for a column even when its explicitly configured. This evaluation is for 
both dimension and metric
+   * column types.
+   * @return
+   */
+  public static boolean ignoreDictionaryOverride(boolean optimizeDictionary,
+      boolean optimizeDictionaryForMetrics, double 
noDictionarySizeRatioThreshold,
+      FieldSpec fieldSpec, FieldIndexConfigs fieldIndexConfigs, int 
colCardinality,
+      int totalNumberOfEntries) {
+    if (optimizeDictionary) {
+      // Do not create dictionaries for json or text index columns as they are 
high-cardinality values almost always
+      if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled() || 
fieldIndexConfigs.getConfig(
+          StandardIndexes.text()).isEnabled())) {
+        return false;
+      }
+      // Do not create dictionary if index size with dictionary is going to be 
larger than index size without dictionary
+      // This is done to reduce the cost of dictionary for high cardinality 
columns
+      // Off by default and needs optimizeDictionary to be set to true
+      if (fieldSpec.isSingleValueField() && 
fieldSpec.getDataType().isFixedWidth()) {
+        // if you can safely enable dictionary, you can ignore overrides
+        return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+            noDictionarySizeRatioThreshold, fieldSpec);
+      }
+    }
+
+    if (optimizeDictionaryForMetrics && !optimizeDictionary) {
+      if (fieldSpec.isSingleValueField() && 
fieldSpec.getDataType().isFixedWidth() && fieldSpec.getFieldType()
+          == FieldSpec.FieldType.METRIC) {
+        // if you can safely enable dictionary, you can ignore overrides
+        return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+            noDictionarySizeRatioThreshold, fieldSpec);
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Given the column cardinality, totalNumberOfEntries, this function checks 
if the savings ratio
+   * is larger than the configured threshold (noDictionarySizeRatioThreshold). 
If savings ratio is
+   * smaller than the threshold, we want to override to noDictionary.
+   * @return

Review Comment:
   nit: remove empty @result



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean 
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
     return false;
   }
 
+  /**
+   * This function evaluates whether to override dictionary (i.e use 
noDictionary)
+   * for a column even when its explicitly configured. This evaluation is for 
both dimension and metric
+   * column types.
+   * @return
+   */
+  public static boolean ignoreDictionaryOverride(boolean optimizeDictionary,
+      boolean optimizeDictionaryForMetrics, double 
noDictionarySizeRatioThreshold,
+      FieldSpec fieldSpec, FieldIndexConfigs fieldIndexConfigs, int 
colCardinality,
+      int totalNumberOfEntries) {
+    if (optimizeDictionary) {
+      // Do not create dictionaries for json or text index columns as they are 
high-cardinality values almost always
+      if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled() || 
fieldIndexConfigs.getConfig(
+          StandardIndexes.text()).isEnabled())) {
+        return false;
+      }
+      // Do not create dictionary if index size with dictionary is going to be 
larger than index size without dictionary
+      // This is done to reduce the cost of dictionary for high cardinality 
columns
+      // Off by default and needs optimizeDictionary to be set to true
+      if (fieldSpec.isSingleValueField() && 
fieldSpec.getDataType().isFixedWidth()) {
+        // if you can safely enable dictionary, you can ignore overrides
+        return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+            noDictionarySizeRatioThreshold, fieldSpec);
+      }
+    }
+
+    if (optimizeDictionaryForMetrics && !optimizeDictionary) {
+      if (fieldSpec.isSingleValueField() && 
fieldSpec.getDataType().isFixedWidth() && fieldSpec.getFieldType()
+          == FieldSpec.FieldType.METRIC) {
+        // if you can safely enable dictionary, you can ignore overrides
+        return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+            noDictionarySizeRatioThreshold, fieldSpec);
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Given the column cardinality, totalNumberOfEntries, this function checks 
if the savings ratio
+   * is larger than the configured threshold (noDictionarySizeRatioThreshold). 
If savings ratio is
+   * smaller than the threshold, we want to override to noDictionary.
+   * @return
+   */
+  private static boolean canSafelyEnableDictionary(int colCardinality, int 
totalNumberOfEntries,

Review Comment:
   nit: can we use the original function name here or use something like: 
`canSafelyCreateDictionaryWithinThreshold` to indicate something about the 
threshold checks.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java:
##########
@@ -283,47 +283,16 @@ private boolean 
createDictionaryForColumn(ColumnIndexCreationInfo info, SegmentG
       return false;
     }
 
-    if (config.isOptimizeDictionary()) {
-
-      FieldIndexConfigs fieldIndexConfigs = 
config.getIndexConfigsByColName().get(column);
-      // Do not create dictionaries for json or text index columns as they are 
high-cardinality values almost always
-      if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled()
-          || fieldIndexConfigs.getConfig(StandardIndexes.text()).isEnabled())) 
{
-        return false;
-      }
-
-      // Do not create dictionary if index size with dictionary is going to be 
larger than index size without dictionary
-      // This is done to reduce the cost of dictionary for high cardinality 
columns
-      // Off by default and needs optimizeDictionary to be set to true
-      if (spec.isSingleValueField() && spec.getDataType().isFixedWidth()) {
-        return shouldCreateDictionaryWithinThreshold(info, config, spec);
-      }
-    }
-
-    if (config.isOptimizeDictionaryForMetrics() && 
!config.isOptimizeDictionary()) {
-      if (spec.isSingleValueField() && spec.getDataType().isFixedWidth() && 
spec.getFieldType() == FieldType.METRIC) {
-        return shouldCreateDictionaryWithinThreshold(info, config, spec);
-      }
-    }
-
-    return info.isCreateDictionary();
-  }
-
-  private boolean 
shouldCreateDictionaryWithinThreshold(ColumnIndexCreationInfo info,
-      SegmentGeneratorConfig config, FieldSpec spec) {
-    long dictionarySize = info.getDistinctValueCount() * (long) 
spec.getDataType().size();
-    long forwardIndexSize =
-        ((long) info.getTotalNumberOfEntries()
-            * PinotDataBitSet.getNumBitsPerValue(info.getDistinctValueCount() 
- 1) + Byte.SIZE - 1) / Byte.SIZE;
-
-    double indexWithDictSize = dictionarySize + forwardIndexSize;
-    double indexWithoutDictSize = info.getTotalNumberOfEntries() * 
spec.getDataType().size();
-
-    double indexSizeRatio = indexWithoutDictSize / indexWithDictSize;
-    if (indexSizeRatio <= config.getNoDictionarySizeRatioThreshold()) {
+    FieldIndexConfigs fieldIndexConfigs = 
config.getIndexConfigsByColName().get(column);

Review Comment:
   nit: for easier readability, how about you set a variable defaulted to false 
and return that after the if condition?
   
   e.g. 
   
   ```
   boolean shouldCreateDictionary = false;
   if 
(DictionaryIndexType.ignoreDictionaryOverride(config.isOptimizeDictionary(),
           config.isOptimizeDictionaryForMetrics(), 
config.getNoDictionarySizeRatioThreshold(), spec,
           fieldIndexConfigs, info.getDistinctValueCount(), 
info.getTotalNumberOfEntries())) {
       // Ignore overrides and pick from config
       shouldCreateDictionary = info.isCreateDictionary();
   }
   return shouldCreateDictionary;
   ```



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean 
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
     return false;
   }
 
+  /**
+   * This function evaluates whether to override dictionary (i.e use 
noDictionary)
+   * for a column even when its explicitly configured. This evaluation is for 
both dimension and metric
+   * column types.
+   * @return

Review Comment:
   nit: remove this empty @return



##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandlerTest.java:
##########
@@ -2239,6 +2243,47 @@ public void 
testAddOtherIndexWhenForwardIndexDisabledAndInvertedIndexOrDictionar
     }
   }
 
+  /**
+   * Tests to verify various combinations of inputs to test dictionary 
override optimization.
+   */
+  @Test
+  public void testDictionaryOverride() {
+    FieldSpec fieldSpec = new MetricFieldSpec();
+    fieldSpec.setName("test");
+    fieldSpec.setDataType(FieldSpec.DataType.DOUBLE);
+    IndexType index1 = Mockito.mock(IndexType.class);
+    Mockito.when(index1.getId()).thenReturn("index1");
+    IndexConfig indexConf = new IndexConfig(true);
+    FieldIndexConfigs fieldIndexConfigs = new FieldIndexConfigs.Builder()
+        .add(index1, indexConf)
+        .build();
+    // No need to disable dictionary
+    boolean result = DictionaryIndexType.ignoreDictionaryOverride(false, true,
+        2, fieldSpec,
+        fieldIndexConfigs, 5, 20);
+    Assert.assertEquals(result, true);
+
+    // Set a higher noDictionarySizeRatioThreshold
+    result = DictionaryIndexType.ignoreDictionaryOverride(false, true,
+        5, fieldSpec,
+        fieldIndexConfigs, 5, 20);
+    Assert.assertEquals(result, false);
+
+    // optimizeDictionary and optimizeDictionaryForMetrics both turned on
+    result = DictionaryIndexType.ignoreDictionaryOverride(true, true,
+        5, fieldSpec,
+        fieldIndexConfigs, 5, 20);
+    Assert.assertEquals(result, false);
+
+    // Don't ignore for Json. We want to disable dictionary for json.
+    fieldSpec = new DimensionFieldSpec();
+    fieldSpec.setName("test");
+    fieldSpec.setDataType(FieldSpec.DataType.JSON);
+    result = DictionaryIndexType.ignoreDictionaryOverride(true, true,
+        5, fieldSpec,
+        fieldIndexConfigs, 5, 20);
+    Assert.assertEquals(result, true);
+  }

Review Comment:
   nit: please add an empty space after this line



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandler.java:
##########
@@ -309,8 +309,16 @@ Map<String, List<Operation>> 
computeOperations(SegmentDirectory.Reader segmentRe
           LOGGER.warn("Cannot enable dictionary for column={} as schema or 
tableConfig is null.", column);
           continue;
         }
-
-        columnOperationsMap.put(column, 
Collections.singletonList(Operation.ENABLE_DICTIONARY));
+        ColumnMetadata existingColumnMetadata = 
_segmentDirectory.getSegmentMetadata().getColumnMetadataFor(column);

Review Comment:
   Looks like even for `forwardIndexDisabled`, the optimize configs can be set. 
Can you either add this check for the `forwardIndexDisabled` code paths as 
well, or add some code to the 
`TableConfigUtils::validateForwardIndexDisabledIndexCompatibility()` to 
disallow setting the optimizer configs when the forward index is disabled. I 
think the latter may make more sense since if the forward index is disabled 
then there is no point in applying this optimization and there are other 
complications with not having a dictionary.
   
   I'd make this decision based on whether the main reason for these optimize 
configs is to keep the forward index small or whether to avoid the dictionary 
creation. I believe it's for the former, but confirm with @KKcorps as well.
   
   Please add tests for the scenario as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to