[GitHub] [druid] imply-cheddar commented on a diff in pull request #13732: various nested column (and other) fixes

via GitHub Thu, 02 Feb 2023 21:33:00 -0800


imply-cheddar commented on code in PR #13732:
URL: https://github.com/apache/druid/pull/13732#discussion_r1095351855



##########
processing/src/main/java/org/apache/druid/segment/vector/NilVectorSelector.java:
##########
@@ -82,9 +81,11 @@ public static NilVectorSelector create(final 
VectorSizeInspector vectorSizeInspe
           DEFAULT_OBJECT_VECTOR
       );
     } else {
+      final boolean[] nulls = new 
boolean[vectorSizeInspector.getMaxVectorSize()];
+      Arrays.fill(nulls, NullHandling.sqlCompatible());

Review Comment:
   The default initialization for a native `boolean` is false, so we should 
really only do this if `NullHandling.sqlCompatible()` is true.



##########
processing/src/main/java/org/apache/druid/segment/virtual/NestedFieldVirtualColumn.java:
##########
@@ -233,20 +245,49 @@ public DimensionSelector makeDimensionSelector(
       ReadableOffset offset
   )
   {
-    final NestedDataComplexColumn column = 
NestedDataComplexColumn.fromColumnSelector(columnSelector, columnName);
-    if (column == null) {
-      // complex column itself didn't exist
-      return DimensionSelector.constant(null);
+    ColumnHolder holder = columnSelector.getColumnHolder(columnName);
+    if (holder == null) {
+      // column doesn't exist
+      return dimensionSpec.decorate(DimensionSelector.constant(null, 
dimensionSpec.getExtractionFn()));
     }
     if (hasNegativeArrayIndex) {
-      return new FieldDimensionSelector(
-          new RawFieldLiteralColumnValueSelector(
-              column.makeColumnValueSelector(offset),
-              parts
-          )
-      );
+      // if the path has negative array elements, then we have to use the 
'raw' processing of the FieldDimensionSelector
+      // created with the column selector factory instead of using the 
optimized nested field column, return null
+      // to fall through
+      return null;
     }
-    return column.makeDimensionSelector(parts, offset, 
dimensionSpec.getExtractionFn());
+
+    return dimensionSpec.decorate(makeDimensionSelectorUndecorated(holder, 
offset, dimensionSpec.getExtractionFn()));
+  }
+
+  private DimensionSelector makeDimensionSelectorUndecorated(
+      ColumnHolder holder,
+      ReadableOffset offset,
+      @Nullable ExtractionFn extractionFn
+  )
+  {
+    BaseColumn theColumn = holder.getColumn();
+    if (theColumn instanceof NestedDataComplexColumn) {
+      final NestedDataComplexColumn column = (NestedDataComplexColumn) 
theColumn;
+      return column.makeDimensionSelector(parts, offset, extractionFn);
+    }
+
+    // ok, so not a nested column, but we can still do stuff
+    if (!parts.isEmpty()) {
+      // we are being asked for a path that will never exist, so we are null 
selector
+      return DimensionSelector.constant(null, extractionFn);
+    }

Review Comment:
   Is this because `parts.isEmpty()` means that we are looking for the root?  
And if we are not looking for the root then it's impossible to have something?  
If so, can we move the stuff that's after this if to be inside of the if (to 
eliminate the negation) and adjust the comments to explain the relationship 
between `parts.isEmpty()` and the "real world/query"?



##########
processing/src/test/resources/simple-nested-test-data.json:
##########
@@ -1,7 +1,7 @@
 {"timestamp": "2021-01-01", "dim": "hello", "nest": {"x": 100, "y": 200, "z": 
300}, "nester":{ "x": ["a", "b", "c"], "y": {"a": "a", "b": "b", "c": [1, 2, 
3]}}, "variant": {"a": ["hello", "world"], "b": {"x": "hello", "y": "world"}}, 
"list":[{"x": 5, "y": 10}, {"x": 15, "y": 22}]}
 {"timestamp": "2021-01-01", "dim": "hello", "nester":{ "x": ["x", "y", "z"]}, 
"list":[{"x": 35, "y": 310}, {"x": 315, "y": 322}]}
 {"timestamp": "2021-01-01", "dim": "hello", "nest":{ "x": 300, "y":  800}, 
"nester": "hello"}
-{"timestamp": "2021-01-01", "dim": "hello", "nest":{ "y": 500}, "list":[{"x": 
115, "y": 410}, {"x": 415, "y": 422}]}
+{"timestamp": "2021-01-01", "dim": "100", "nest":{ "y": 500}, "list":[{"x": 
115, "y": 410}, {"x": 415, "y": 422}]}

Review Comment:
   I noticed that it's still a string.  Was this change also wanting to test 
adding a new type and having it be a number?  Or is this being used to validate 
that the parsing of numbers works?



##########
processing/src/main/java/org/apache/druid/segment/virtual/NestedFieldVirtualColumn.java:
##########
@@ -233,20 +245,49 @@
       ReadableOffset offset
   )
   {
-    final NestedDataComplexColumn column = 
NestedDataComplexColumn.fromColumnSelector(columnSelector, columnName);
-    if (column == null) {
-      // complex column itself didn't exist
-      return DimensionSelector.constant(null);
+    ColumnHolder holder = columnSelector.getColumnHolder(columnName);
+    if (holder == null) {
+      // column doesn't exist
+      return dimensionSpec.decorate(DimensionSelector.constant(null));
     }
     if (hasNegativeArrayIndex) {
-      return new FieldDimensionSelector(
-          new RawFieldLiteralColumnValueSelector(
-              column.makeColumnValueSelector(offset),
-              parts
-          )
-      );
+      // if the path has negative array elements, then we have to use the 
'raw' processing of the FieldDimensionSelector
+      // created with the column selector factory instead of using the 
optimized nested field column, return null
+      // to fall through
+      return null;
+    }
+
+    return dimensionSpec.decorate(makeDimensionSelectorUndecorated(holder, 
offset, dimensionSpec.getExtractionFn()));

Review Comment:
   True statement, should be ignored in this case.



##########
sql/src/test/java/org/apache/druid/sql/calcite/CalciteNestedDataQueryTest.java:
##########
@@ -174,7 +198,72 @@
                     .rows(ROWS)
                     .buildMMappedIndex();
 
-    return new SpecificSegmentsQuerySegmentWalker(conglomerate).add(
+    final QueryableIndex indexMix11 =
+        IndexBuilder.create()
+                    .tmpDir(temporaryFolder.newFolder())
+                    
.segmentWriteOutMediumFactory(OffHeapMemorySegmentWriteOutMediumFactory.instance())
+                    .schema(
+                        new IncrementalIndexSchema.Builder()
+                            .withMetrics(
+                                new CountAggregatorFactory("cnt")
+                            )
+                            .withDimensionsSpec(PARSER)

Review Comment:
   Maybe make this call with `PARSER.getParseSpec().getDimensionSpec()` instead 
and avoid usage of the deprecated method?



##########
processing/src/main/java/org/apache/druid/segment/virtual/NestedFieldVirtualColumn.java:
##########
@@ -233,20 +245,49 @@ public DimensionSelector makeDimensionSelector(
       ReadableOffset offset
   )
   {
-    final NestedDataComplexColumn column = 
NestedDataComplexColumn.fromColumnSelector(columnSelector, columnName);
-    if (column == null) {
-      // complex column itself didn't exist
-      return DimensionSelector.constant(null);
+    ColumnHolder holder = columnSelector.getColumnHolder(columnName);
+    if (holder == null) {
+      // column doesn't exist
+      return dimensionSpec.decorate(DimensionSelector.constant(null, 
dimensionSpec.getExtractionFn()));
     }
     if (hasNegativeArrayIndex) {
-      return new FieldDimensionSelector(
-          new RawFieldLiteralColumnValueSelector(
-              column.makeColumnValueSelector(offset),
-              parts
-          )
-      );
+      // if the path has negative array elements, then we have to use the 
'raw' processing of the FieldDimensionSelector
+      // created with the column selector factory instead of using the 
optimized nested field column, return null
+      // to fall through
+      return null;
     }
-    return column.makeDimensionSelector(parts, offset, 
dimensionSpec.getExtractionFn());
+
+    return dimensionSpec.decorate(makeDimensionSelectorUndecorated(holder, 
offset, dimensionSpec.getExtractionFn()));
+  }
+
+  private DimensionSelector makeDimensionSelectorUndecorated(
+      ColumnHolder holder,
+      ReadableOffset offset,
+      @Nullable ExtractionFn extractionFn
+  )
+  {
+    BaseColumn theColumn = holder.getColumn();
+    if (theColumn instanceof NestedDataComplexColumn) {
+      final NestedDataComplexColumn column = (NestedDataComplexColumn) 
theColumn;
+      return column.makeDimensionSelector(parts, offset, extractionFn);
+    }
+
+    // ok, so not a nested column, but we can still do stuff
+    if (!parts.isEmpty()) {
+      // we are being asked for a path that will never exist, so we are null 
selector
+      return DimensionSelector.constant(null, extractionFn);
+    }
+
+    // the path was the 'root', we're in luck, spit out a selector that 
behaves the same way as a nested column

Review Comment:
   why does the `instanceof DictionaryEncoded` check tell us that the path was 
the root?  Or is that part intended to be explaining the if before this?



##########
processing/src/main/java/org/apache/druid/segment/virtual/NestedFieldVirtualColumn.java:
##########
@@ -258,18 +299,37 @@ public ColumnValueSelector<?> makeColumnValueSelector(
       ReadableOffset offset
   )
   {
-    final NestedDataComplexColumn column = 
NestedDataComplexColumn.fromColumnSelector(columnSelector, this.columnName);
-    if (column == null) {
+    ColumnHolder holder = columnSelector.getColumnHolder(this.columnName);
+    if (holder == null) {
       return NilColumnValueSelector.instance();
     }
+    BaseColumn theColumn = holder.getColumn();
 
-    // processFromRaw is true, that means JSON_QUERY, which can return partial 
results, otherwise this virtual column
-    // is JSON_VALUE which only returns literals, so we can use the nested 
columns value selector
-    return processFromRaw
-           ? new 
RawFieldColumnSelector(column.makeColumnValueSelector(offset), parts)
-           : hasNegativeArrayIndex
-             ? new 
RawFieldLiteralColumnValueSelector(column.makeColumnValueSelector(offset), 
parts)
-             : column.makeColumnValueSelector(parts, offset);
+    if (processFromRaw || hasNegativeArrayIndex) {
+      // if the path has negative array elements, or has set the flag to 
process 'raw' values explicitly (JSON_QUERY),
+      // then we use the 'raw' processing of the 
RawFieldColumnSelector/RawFieldLiteralColumnValueSelector created
+      // with the column selector factory instead of using the optimized 
nested field column
+      return null;
+    }

Review Comment:
   I understand why processFromRaw takes us into this if, I don't understand 
why `hasNegativeArrayIndex` does?



##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSupplier.java:
##########
@@ -168,22 +173,85 @@ public NestedDataColumnSupplier(
         );
         if (metadata.hasNulls()) {
           columnBuilder.setHasNulls(true);
-          final ByteBuffer nullIndexBuffer = loadInternalFile(mapper, 
NestedDataColumnSerializer.NULL_BITMAP_FILE_NAME);
+          final ByteBuffer nullIndexBuffer = loadInternalFile(
+              mapper,
+              metadata,
+              NestedDataColumnSerializer.NULL_BITMAP_FILE_NAME
+          );
           nullValues = 
metadata.getBitmapSerdeFactory().getObjectStrategy().fromByteBufferWithSize(nullIndexBuffer);
         } else {
           nullValues = 
metadata.getBitmapSerdeFactory().getBitmapFactory().makeEmptyImmutableBitmap();
         }
+
+        return new NestedDataColumnSupplier(
+            version,
+            metadata,
+            fields,
+            fieldInfo,
+            compressedRawColumnSupplier,
+            nullValues,
+            stringDictionary,
+            frontCodedStringDictionarySupplier,
+            longDictionarySupplier,
+            doubleDictionarySupplier,
+            columnConfig,
+            mapper,
+            simpleType
+        );

Review Comment:
   When building this, it appears like we are loading various files to then use 
in the suppliers.  Given the recent sensitivity around the amount of data held 
on JVM heap per segment, I'm curious if we are being as judicious as possible 
with our JVM memory.  We have to draw a fine line between performance and not 
performing hard-work multiple times and JVM heap, 'cause we can't fit 
everything in it.
   
   With the changes here, I would have to dig in further to really get a sense 
for how it is changing the JVM heap memory usage, so I'm just wondering what 
your thought process was around these changes and their impact to heap memory?



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -798,7 +798,20 @@ DatasourceTable.PhysicalDatasourceMetadata 
buildDruidTable(final String dataSour
                 rowSignature.getColumnType(column)
                             .orElseThrow(() -> new ISE("Encountered null type 
for column [%s]", column));
 
-            columnTypes.putIfAbsent(column, columnType);
+            columnTypes.compute(column, (c, existingType) -> {
+              if (existingType == null) {
+                return columnType;
+              }
+              if (columnType == null) {
+                return existingType;
+              }
+              // if any are json, are all json
+              if (NestedDataComplexTypeSerde.TYPE.equals(columnType) || 
NestedDataComplexTypeSerde.TYPE.equals(existingType)) {
+                return NestedDataComplexTypeSerde.TYPE;
+              }
+              // "existing type" is the 'newest' type, since we iterate the 
segments list by newest start time
+              return existingType;
+            });

Review Comment:
   This is a question not actually about your code, but the code that existed 
before it...  Given that this is using a compute to put values into the map, 
how do these things change?



##########
processing/src/main/java/org/apache/druid/segment/virtual/NestedFieldVirtualColumn.java:
##########
@@ -233,20 +245,49 @@
       ReadableOffset offset
   )
   {
-    final NestedDataComplexColumn column = 
NestedDataComplexColumn.fromColumnSelector(columnSelector, columnName);
-    if (column == null) {
-      // complex column itself didn't exist
-      return DimensionSelector.constant(null);
+    ColumnHolder holder = columnSelector.getColumnHolder(columnName);
+    if (holder == null) {
+      // column doesn't exist
+      return dimensionSpec.decorate(DimensionSelector.constant(null, 
dimensionSpec.getExtractionFn()));

Review Comment:
   True statement, should be ignored in this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] imply-cheddar commented on a diff in pull request #13732: various nested column (and other) fixes

Reply via email to