subject:"\[GitHub\] \[incubator\-pinot\] siddharthteotia commented on a change in pull request #5470\: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled."

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

2020-05-30 Thread GitBox



siddharthteotia commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432898453



##
File path: 
pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##
@@ -213,6 +215,14 @@ public void init(SegmentGeneratorConfig 
segmentCreationSpec, SegmentIndexCreatio
 }
   }
 
+  public static boolean shouldDeriveNumDocsPerChunk(String columnName, 
Map> columnProperties) {
+if (columnProperties != null) {
+  Map properties = columnProperties.get(columnName);
+  return properties != null && 
Boolean.parseBoolean(properties.get(FieldConfig.DERIVE_NUM_DOCS_PER_CHUNK_RAW_INDEX_KEY));

Review comment:
   > Defaults to false, right?
   
   Yes
   
   > Can we derive it automatically? (e.g. if column is text index then we 
derive it from metadata) Or, do you see this being usefiul in other cases as 
well?
   
   We could. Even for columns with text indexes, I don't think we should use it 
by default (since now that we have seen the potential -ve impact related to 
access pattern). Yes, most likely this will be used for columns with text index 
but only if the average column value size is very large (around 1-2MB) since 
that is the case which takes the chunk size and compressed chunk size (2 * raw) 
> 2GB and deriving the numDocsPerChunk becomes useful. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

2020-05-30 Thread GitBox



siddharthteotia commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432898231



##
File path: 
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
*/
   private boolean isNoDictionaryColumn(Set noDictionaryColumns, 
Set invertedIndexColumns,
   Set textIndexColumns, FieldSpec fieldSpec, String column) {
-return textIndexColumns.contains(column) || 
(noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-&& !invertedIndexColumns.contains(column));
+if (textIndexColumns.contains(column)) {

Review comment:
   Actually it is not needed anymore. I already do the validation upfront 
in TableConfig (already in master). So we can remove it. Per column config is 
enough





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

2020-05-30 Thread GitBox



siddharthteotia commented on a change in pull request #5470:
URL: https://github.com/apache/incubator-pinot/pull/5470#discussion_r432896416



##
File path: 
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
*/
   private boolean isNoDictionaryColumn(Set noDictionaryColumns, 
Set invertedIndexColumns,
   Set textIndexColumns, FieldSpec fieldSpec, String column) {
-return textIndexColumns.contains(column) || 
(noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-&& !invertedIndexColumns.contains(column));
+if (textIndexColumns.contains(column)) {
+  // text column is no dictionary currently
+  return true;
+}
+FieldSpec.DataType dataType = fieldSpec.getDataType();
+if (noDictionaryColumns.contains(column)) {
+  // Earlier we didn't support noDict in consuming segments for STRING and 
BYTES columns.
+  // So even if the user had the column in noDictionaryColumns set in 
table config, we still
+  // created dictionary in consuming segments.
+  // Later on we added this support. There is a particular impact of this 
change on the use cases
+  // that have set noDict on their STRING dimension columns for other 
performance
+  // reasons and also want metricsAggregation. These use cases don't get to
+  // aggregateMetrics because the new implementation is able to honor 
their table config setting
+  // of noDict on STRING/BYTES. Without metrics aggregation, memory 
pressure increases.
+  // So to continue aggregating metrics for such cases, we will create 
dictionary even
+  // if the column is part of noDictionary set from table config
+  if (fieldSpec instanceof DimensionFieldSpec && _aggregateMetrics && 
(dataType == FieldSpec.DataType.STRING ||

Review comment:
   Yes we checked for STRING. See here 
https://github.com/apache/incubator-pinot/pull/4791
   
   Note that there is a method  `enableMetricsAggregationIfPossible` that 
decides whether metrics can be aggregated or not and that checks whether all 
dimensions have dictionary, all metrics should not have dictionary and should 
be SV etc. That method is still intact.
   
   Just that during initialization of MutableSegmentImpl, we used to check for 
STRING and remove it from noDictionaryColumns set since raw index wasn't 
supported. This was actually the reason why the use cases were able to specify 
it as noDict in config and still able to aggregate metrics. 

##
File path: 
pinot-core/src/main/java/org/apache/pinot/core/indexsegment/mutable/MutableSegmentImpl.java
##
@@ -330,8 +330,32 @@ public long getLatestIngestionTimestamp() {
*/
   private boolean isNoDictionaryColumn(Set noDictionaryColumns, 
Set invertedIndexColumns,
   Set textIndexColumns, FieldSpec fieldSpec, String column) {
-return textIndexColumns.contains(column) || 
(noDictionaryColumns.contains(column) && fieldSpec.isSingleValueField()
-&& !invertedIndexColumns.contains(column));
+if (textIndexColumns.contains(column)) {

Review comment:
   done

##
File path: 
pinot-core/src/main/java/org/apache/pinot/core/segment/creator/impl/SegmentColumnarIndexCreator.java
##
@@ -193,9 +194,10 @@ public void init(SegmentGeneratorConfig 
segmentCreationSpec, SegmentIndexCreatio
 getColumnCompressionType(segmentCreationSpec, fieldSpec);
 
 // Initialize forward index creator
+boolean deriveNumChunksForVarByteRawIndex = 
shouldDeriveNumChunksForRawIndex(columnName, 
segmentCreationSpec.getColumnProperties());

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #5470: Derive numDocsPerChunk for var byte raw index from metadata only if config is enabled.

3 matches

Site Navigation

Mail list logo

Footer information