somandal commented on code in PR #10557:
URL: https://github.com/apache/pinot/pull/10557#discussion_r1159213930
##########
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexingConfig.java:
##########
@@ -31,6 +31,7 @@ public class IndexingConfig extends BaseJsonConfig {
* This should be equal to the one specified in RangeIndexType.
*/
private static final int DEFAULT_RANGE_INDEX_VERSION = 2;
+ public static final double DEFAULT_NO_DICTIONARY_SIZE_RATIO_THRESHOLD =
0.85d;
Review Comment:
nit: can you move this line above the `DEFAULT_RANGE_INDEX_VERSION` (since
this is public and all the others are private)
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
return false;
}
+ /**
+ * This function evaluates whether to override dictionary (i.e use
noDictionary)
+ * for a column even when its explicitly configured. This evaluation is for
both dimension and metric
+ * column types.
+ * @return
+ */
+ public static boolean ignoreDictionaryOverride(boolean optimizeDictionary,
+ boolean optimizeDictionaryForMetrics, double
noDictionarySizeRatioThreshold,
+ FieldSpec fieldSpec, FieldIndexConfigs fieldIndexConfigs, int
colCardinality,
+ int totalNumberOfEntries) {
+ if (optimizeDictionary) {
+ // Do not create dictionaries for json or text index columns as they are
high-cardinality values almost always
+ if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled() ||
fieldIndexConfigs.getConfig(
+ StandardIndexes.text()).isEnabled())) {
+ return false;
+ }
+ // Do not create dictionary if index size with dictionary is going to be
larger than index size without dictionary
+ // This is done to reduce the cost of dictionary for high cardinality
columns
+ // Off by default and needs optimizeDictionary to be set to true
+ if (fieldSpec.isSingleValueField() &&
fieldSpec.getDataType().isFixedWidth()) {
+ // if you can safely enable dictionary, you can ignore overrides
+ return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+ noDictionarySizeRatioThreshold, fieldSpec);
+ }
+ }
+
+ if (optimizeDictionaryForMetrics && !optimizeDictionary) {
+ if (fieldSpec.isSingleValueField() &&
fieldSpec.getDataType().isFixedWidth() && fieldSpec.getFieldType()
+ == FieldSpec.FieldType.METRIC) {
+ // if you can safely enable dictionary, you can ignore overrides
+ return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+ noDictionarySizeRatioThreshold, fieldSpec);
+ }
+ }
+ return true;
+ }
+
+ /**
+ * Given the column cardinality, totalNumberOfEntries, this function checks
if the savings ratio
+ * is larger than the configured threshold (noDictionarySizeRatioThreshold).
If savings ratio is
+ * smaller than the threshold, we want to override to noDictionary.
+ * @return
+ */
+ private static boolean canSafelyEnableDictionary(int colCardinality, int
totalNumberOfEntries,
Review Comment:
nit: rename colCardinality -> cardinality
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
return false;
}
+ /**
+ * This function evaluates whether to override dictionary (i.e use
noDictionary)
+ * for a column even when its explicitly configured. This evaluation is for
both dimension and metric
+ * column types.
+ * @return
+ */
+ public static boolean ignoreDictionaryOverride(boolean optimizeDictionary,
+ boolean optimizeDictionaryForMetrics, double
noDictionarySizeRatioThreshold,
+ FieldSpec fieldSpec, FieldIndexConfigs fieldIndexConfigs, int
colCardinality,
+ int totalNumberOfEntries) {
+ if (optimizeDictionary) {
+ // Do not create dictionaries for json or text index columns as they are
high-cardinality values almost always
+ if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled() ||
fieldIndexConfigs.getConfig(
+ StandardIndexes.text()).isEnabled())) {
+ return false;
+ }
+ // Do not create dictionary if index size with dictionary is going to be
larger than index size without dictionary
+ // This is done to reduce the cost of dictionary for high cardinality
columns
+ // Off by default and needs optimizeDictionary to be set to true
+ if (fieldSpec.isSingleValueField() &&
fieldSpec.getDataType().isFixedWidth()) {
+ // if you can safely enable dictionary, you can ignore overrides
+ return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+ noDictionarySizeRatioThreshold, fieldSpec);
+ }
+ }
+
+ if (optimizeDictionaryForMetrics && !optimizeDictionary) {
+ if (fieldSpec.isSingleValueField() &&
fieldSpec.getDataType().isFixedWidth() && fieldSpec.getFieldType()
+ == FieldSpec.FieldType.METRIC) {
+ // if you can safely enable dictionary, you can ignore overrides
+ return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+ noDictionarySizeRatioThreshold, fieldSpec);
+ }
+ }
+ return true;
+ }
+
+ /**
+ * Given the column cardinality, totalNumberOfEntries, this function checks
if the savings ratio
+ * is larger than the configured threshold (noDictionarySizeRatioThreshold).
If savings ratio is
+ * smaller than the threshold, we want to override to noDictionary.
+ * @return
Review Comment:
nit: remove empty @result
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
return false;
}
+ /**
+ * This function evaluates whether to override dictionary (i.e use
noDictionary)
+ * for a column even when its explicitly configured. This evaluation is for
both dimension and metric
+ * column types.
+ * @return
+ */
+ public static boolean ignoreDictionaryOverride(boolean optimizeDictionary,
+ boolean optimizeDictionaryForMetrics, double
noDictionarySizeRatioThreshold,
+ FieldSpec fieldSpec, FieldIndexConfigs fieldIndexConfigs, int
colCardinality,
+ int totalNumberOfEntries) {
+ if (optimizeDictionary) {
+ // Do not create dictionaries for json or text index columns as they are
high-cardinality values almost always
+ if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled() ||
fieldIndexConfigs.getConfig(
+ StandardIndexes.text()).isEnabled())) {
+ return false;
+ }
+ // Do not create dictionary if index size with dictionary is going to be
larger than index size without dictionary
+ // This is done to reduce the cost of dictionary for high cardinality
columns
+ // Off by default and needs optimizeDictionary to be set to true
+ if (fieldSpec.isSingleValueField() &&
fieldSpec.getDataType().isFixedWidth()) {
+ // if you can safely enable dictionary, you can ignore overrides
+ return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+ noDictionarySizeRatioThreshold, fieldSpec);
+ }
+ }
+
+ if (optimizeDictionaryForMetrics && !optimizeDictionary) {
+ if (fieldSpec.isSingleValueField() &&
fieldSpec.getDataType().isFixedWidth() && fieldSpec.getFieldType()
+ == FieldSpec.FieldType.METRIC) {
+ // if you can safely enable dictionary, you can ignore overrides
+ return canSafelyEnableDictionary(colCardinality, totalNumberOfEntries,
+ noDictionarySizeRatioThreshold, fieldSpec);
+ }
+ }
+ return true;
+ }
+
+ /**
+ * Given the column cardinality, totalNumberOfEntries, this function checks
if the savings ratio
+ * is larger than the configured threshold (noDictionarySizeRatioThreshold).
If savings ratio is
+ * smaller than the threshold, we want to override to noDictionary.
+ * @return
+ */
+ private static boolean canSafelyEnableDictionary(int colCardinality, int
totalNumberOfEntries,
Review Comment:
nit: can we use the original function name here or use something like:
`canSafelyCreateDictionaryWithinThreshold` to indicate something about the
threshold checks.
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java:
##########
@@ -283,47 +283,16 @@ private boolean
createDictionaryForColumn(ColumnIndexCreationInfo info, SegmentG
return false;
}
- if (config.isOptimizeDictionary()) {
-
- FieldIndexConfigs fieldIndexConfigs =
config.getIndexConfigsByColName().get(column);
- // Do not create dictionaries for json or text index columns as they are
high-cardinality values almost always
- if ((fieldIndexConfigs.getConfig(StandardIndexes.json()).isEnabled()
- || fieldIndexConfigs.getConfig(StandardIndexes.text()).isEnabled()))
{
- return false;
- }
-
- // Do not create dictionary if index size with dictionary is going to be
larger than index size without dictionary
- // This is done to reduce the cost of dictionary for high cardinality
columns
- // Off by default and needs optimizeDictionary to be set to true
- if (spec.isSingleValueField() && spec.getDataType().isFixedWidth()) {
- return shouldCreateDictionaryWithinThreshold(info, config, spec);
- }
- }
-
- if (config.isOptimizeDictionaryForMetrics() &&
!config.isOptimizeDictionary()) {
- if (spec.isSingleValueField() && spec.getDataType().isFixedWidth() &&
spec.getFieldType() == FieldType.METRIC) {
- return shouldCreateDictionaryWithinThreshold(info, config, spec);
- }
- }
-
- return info.isCreateDictionary();
- }
-
- private boolean
shouldCreateDictionaryWithinThreshold(ColumnIndexCreationInfo info,
- SegmentGeneratorConfig config, FieldSpec spec) {
- long dictionarySize = info.getDistinctValueCount() * (long)
spec.getDataType().size();
- long forwardIndexSize =
- ((long) info.getTotalNumberOfEntries()
- * PinotDataBitSet.getNumBitsPerValue(info.getDistinctValueCount()
- 1) + Byte.SIZE - 1) / Byte.SIZE;
-
- double indexWithDictSize = dictionarySize + forwardIndexSize;
- double indexWithoutDictSize = info.getTotalNumberOfEntries() *
spec.getDataType().size();
-
- double indexSizeRatio = indexWithoutDictSize / indexWithDictSize;
- if (indexSizeRatio <= config.getNoDictionarySizeRatioThreshold()) {
+ FieldIndexConfigs fieldIndexConfigs =
config.getIndexConfigsByColName().get(column);
Review Comment:
nit: for easier readability, how about you set a variable defaulted to false
and return that after the if condition?
e.g.
```
boolean shouldCreateDictionary = false;
if
(DictionaryIndexType.ignoreDictionaryOverride(config.isOptimizeDictionary(),
config.isOptimizeDictionaryForMetrics(),
config.getNoDictionarySizeRatioThreshold(), spec,
fieldIndexConfigs, info.getDistinctValueCount(),
info.getTotalNumberOfEntries())) {
// Ignore overrides and pick from config
shouldCreateDictionary = info.isCreateDictionary();
}
return shouldCreateDictionary;
```
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/dictionary/DictionaryIndexType.java:
##########
@@ -191,6 +192,66 @@ public static boolean
shouldUseVarLengthDictionary(FieldSpec.DataType columnStor
return false;
}
+ /**
+ * This function evaluates whether to override dictionary (i.e use
noDictionary)
+ * for a column even when its explicitly configured. This evaluation is for
both dimension and metric
+ * column types.
+ * @return
Review Comment:
nit: remove this empty @return
##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandlerTest.java:
##########
@@ -2239,6 +2243,47 @@ public void
testAddOtherIndexWhenForwardIndexDisabledAndInvertedIndexOrDictionar
}
}
+ /**
+ * Tests to verify various combinations of inputs to test dictionary
override optimization.
+ */
+ @Test
+ public void testDictionaryOverride() {
+ FieldSpec fieldSpec = new MetricFieldSpec();
+ fieldSpec.setName("test");
+ fieldSpec.setDataType(FieldSpec.DataType.DOUBLE);
+ IndexType index1 = Mockito.mock(IndexType.class);
+ Mockito.when(index1.getId()).thenReturn("index1");
+ IndexConfig indexConf = new IndexConfig(true);
+ FieldIndexConfigs fieldIndexConfigs = new FieldIndexConfigs.Builder()
+ .add(index1, indexConf)
+ .build();
+ // No need to disable dictionary
+ boolean result = DictionaryIndexType.ignoreDictionaryOverride(false, true,
+ 2, fieldSpec,
+ fieldIndexConfigs, 5, 20);
+ Assert.assertEquals(result, true);
+
+ // Set a higher noDictionarySizeRatioThreshold
+ result = DictionaryIndexType.ignoreDictionaryOverride(false, true,
+ 5, fieldSpec,
+ fieldIndexConfigs, 5, 20);
+ Assert.assertEquals(result, false);
+
+ // optimizeDictionary and optimizeDictionaryForMetrics both turned on
+ result = DictionaryIndexType.ignoreDictionaryOverride(true, true,
+ 5, fieldSpec,
+ fieldIndexConfigs, 5, 20);
+ Assert.assertEquals(result, false);
+
+ // Don't ignore for Json. We want to disable dictionary for json.
+ fieldSpec = new DimensionFieldSpec();
+ fieldSpec.setName("test");
+ fieldSpec.setDataType(FieldSpec.DataType.JSON);
+ result = DictionaryIndexType.ignoreDictionaryOverride(true, true,
+ 5, fieldSpec,
+ fieldIndexConfigs, 5, 20);
+ Assert.assertEquals(result, true);
+ }
Review Comment:
nit: please add an empty space after this line
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandler.java:
##########
@@ -309,8 +309,16 @@ Map<String, List<Operation>>
computeOperations(SegmentDirectory.Reader segmentRe
LOGGER.warn("Cannot enable dictionary for column={} as schema or
tableConfig is null.", column);
continue;
}
-
- columnOperationsMap.put(column,
Collections.singletonList(Operation.ENABLE_DICTIONARY));
+ ColumnMetadata existingColumnMetadata =
_segmentDirectory.getSegmentMetadata().getColumnMetadataFor(column);
Review Comment:
Looks like even for `forwardIndexDisabled`, the optimize configs can be set.
Can you either add this check for the `forwardIndexDisabled` code paths as
well, or add some code to the
`TableConfigUtils::validateForwardIndexDisabledIndexCompatibility()` to
disallow setting the optimizer configs when the forward index is disabled. I
think the latter may make more sense since if the forward index is disabled
then there is no point in applying this optimization and there are other
complications with not having a dictionary.
I'd make this decision based on whether the main reason for these optimize
configs is to keep the forward index small or whether to avoid the dictionary
creation. I believe it's for the former, but confirm with @KKcorps as well.
Please add tests for the scenario as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]