Re: [PR] Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely (druid)

via GitHub Wed, 01 Oct 2025 19:36:11 -0700


clintropolis commented on code in PR #18589:
URL: https://github.com/apache/druid/pull/18589#discussion_r2396382796



##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -74,4 +82,9 @@ default double skipValueRangeIndexScale()
   {
     return DEFAULT_SKIP_VALUE_RANGE_INDEX_SCALE;
   }
+
+  default boolean deriveJsonColumnFromIndexes()

Review Comment:
   in my opinion i don't think we should add this since its just like a system 
level thing to control what is a segment level thing, though if we do leave it 
we should add javadoc to indicate that this is like a developer testing control 
knob, not something an operator would ever want to set (in case anyone stumbled 
on it in the code or a future developer wonders why it doesn't have docs or 
whatever)



##########
processing/src/main/java/org/apache/druid/segment/data/CompressedVariableSizedBlobColumnSerializer.java:
##########
@@ -51,13 +54,15 @@ public class CompressedVariableSizedBlobColumnSerializer 
implements Serializer
   public CompressedVariableSizedBlobColumnSerializer(
       final String filenameBase,
       final SegmentWriteOutMedium segmentWriteOutMedium,
+      final ObjectStorageEncoding objectStorageEncoding,
       final CompressionStrategy compression
   )
   {
     this.filenameBase = filenameBase;
     this.offsetsFile = getCompressedOffsetsFileName(filenameBase);
     this.blobsFile = getCompressedBlobsFileName(filenameBase);
     this.segmentWriteOutMedium = segmentWriteOutMedium;
+    this.objectStorageEncoding = objectStorageEncoding;

Review Comment:
   this should not be passed into 
`CompressedVariableSizedBlobColumnSerializer`, since it is a json column 
specific thing. I think all of the changes to this and 
`CompressedVariableSizedBlobColumnSupplier` need to be reverted, and should be 
handled entirely within nested data column layer, which should not create this 
serializer (or store the internal 'raw' file that results from it) at all. The 
`objectStorageEncoding` can be passed into the column supplier from 
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/serde/NestedCommonFormatColumnPartSerde.java#L343,
 which the read method can use to know whether or not to bother with the 
`CompressedVariableSizedBlobColumnSupplier`.



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -382,17 +418,32 @@ public void inspectRuntimeShape(RuntimeShapeInspector 
inspector)
   @Override
   public VectorObjectSelector makeVectorObjectSelector(ReadableVectorOffset 
offset)
   {
-    final TKeyDictionary fields = fieldsSupplier.get();
-    if (!logicalType.equals(ColumnType.NESTED_DATA) && fields.size() == 1 && 
rootFieldPath.equals(StringUtils.fromUtf8(fields.get(0)))) {
+    if (!logicalType.equals(ColumnType.NESTED_DATA)
+        && fieldPathMap.size() == 1
+        && 
rootFieldPath.equals(Iterables.getOnlyElement(fieldPathMap.values()).lhs)) {
       return makeVectorObjectSelector(
           Collections.emptyList(),
           null /* not used */,
           offset
       );
     }
-    if (compressedRawColumn == null) {
+
+    if (compressedRawColumn == null && 
!columnConfig.deriveJsonColumnFromIndexes()) {
       compressedRawColumn = closer.register(compressedRawColumnSupplier.get());
     }
+    AtomicInteger rowNumber = new AtomicInteger(-1);
+    AtomicIntegerReadableOffset atomicOffset = new 
AtomicIntegerReadableOffset(rowNumber);
+    final List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors 
=
+        compressedRawColumn != null ? null :

Review Comment:
   maybe we should just make the `compressedRawColumnSupplier` nullable instead 
of checking the thing from the supplier is null?



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -329,28 +330,54 @@ public Object getRowValue(int rowNum)
       return null;
     }
 
-    if (compressedRawColumn == null) {
+    if (compressedRawColumn == null && 
!columnConfig.deriveJsonColumnFromIndexes()) {
       compressedRawColumn = closer.register(compressedRawColumnSupplier.get());
     }
 
-    final ByteBuffer valueBuffer = compressedRawColumn.get(rowNum);
-    return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+    if (compressedRawColumn != null) {
+      final ByteBuffer valueBuffer = compressedRawColumn.get(rowNum);
+      return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+    }
+
+    ReadableOffset offset = new AtomicIntegerReadableOffset(new 
AtomicInteger(rowNum));
+    final List<StructuredDataBuilder.Element> elements =
+        fieldPathMap.keySet().stream()
+                    .map(path -> StructuredDataBuilder.Element.of(
+                        path,
+                        
(Objects.requireNonNull(getColumnHolder(path)).getColumn()).makeColumnValueSelector(offset)
+                                                                               
    .getObject()
+                    ))
+                    .collect(Collectors.toList());

Review Comment:
   making new offset and column value selector for every row seems pretty 
unchill... while I don't think much should be calling this method since we 
implement `makeColumnValueSelector`, could you look further into whether or not 
something will call this? If something does, it would be good to know to either 
consider changing it or knowing if it would be ok. or if nothing is expected to 
call this maybe we throw an exception



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -361,8 +388,17 @@ public Object getObject()
         if (nullValues.get(offset.getOffset())) {
           return null;
         }
-        final ByteBuffer valueBuffer = 
compressedRawColumn.get(offset.getOffset());
-        return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+        if (compressedRawColumn != null) {
+          final ByteBuffer valueBuffer = 
compressedRawColumn.get(offset.getOffset());
+          return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+        }
+        List<StructuredDataBuilder.Element> elements =
+            Objects.requireNonNull(fieldSelectors)
+                   .stream()
+                   .map(c -> StructuredDataBuilder.Element.of(c.lhs, 
c.rhs.getObject()))
+                   .collect(Collectors.toList());
+        return new StructuredDataBuilder(elements).build();
+

Review Comment:
   this might be a bit cleaner to just make separate `ObjectColumnSelector` 
implementations depending on object encoding (as we add more object encodings 
in the future).



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -434,8 +485,17 @@ private Object getForOffset(int offset)
           // maybe someday can use bitmap batch operations for nulls?
           return null;
         }
-        final ByteBuffer valueBuffer = compressedRawColumn.get(offset);
-        return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+        if (compressedRawColumn != null) {
+          final ByteBuffer valueBuffer = compressedRawColumn.get(offset);
+          return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+        } else {
+          rowNumber.set(offset);
+          List<StructuredDataBuilder.Element> elements = fieldSelectors
+              .stream()
+              .map(c -> StructuredDataBuilder.Element.of(c.lhs, 
c.rhs.getObject()))
+              .collect(Collectors.toList());
+          return new StructuredDataBuilder(elements).build();
+        }

Review Comment:
   same as non-vectorized, i think this would be cleaner to have separate 
`VectorObjectSelector` implementations returned depending on object encoding 
rather than checking inside every vector



##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -156,7 +160,12 @@ public CompressedNestedDataComplexColumn(
     this.columnName = columnName;
     this.logicalType = logicalType;
     this.nullValues = nullValues;
-    this.fieldsSupplier = fieldsSupplier;
+    final TKeyDictionary fields = fieldsSupplier.get();
+    this.fieldPathMap = 
CollectionUtils.newLinkedHashMapWithExpectedSize(fields.size());
+    for (int i = 0; i < fields.size(); i++) {
+      String field = StringUtils.fromUtf8(fields.get(i));
+      fieldPathMap.put(parsePath(field), Pair.of(field, i));
+    }

Review Comment:
   i kind of think this should be lazy computed, really complicated json 
objects can have thousands of paths, and we really only need to know stuff 
about all of them if the query is doing something like json_keys/json_paths (or 
this new thing of reconstructing the object from the list of paths). If a query 
just has a json_value for a single field of the object, then it seems like it 
would be penalized by having to build this map for a bunch of paths it doesn't 
care about.
   
   If changed to be lazy computed, the column selector methods for fields (and 
any other methods that take a list of path parts like `getFieldLogicalType`) 
should not use this map unless it was computed, since those are the things 
called when we want to make a selector for a specific path directly by things 
like NestedFieldVirtualColumn and other nested column specialized virtual 
columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely (druid)

Reply via email to