clintropolis commented on code in PR #18589:
URL: https://github.com/apache/druid/pull/18589#discussion_r2396382796
##########
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java:
##########
@@ -74,4 +82,9 @@ default double skipValueRangeIndexScale()
{
return DEFAULT_SKIP_VALUE_RANGE_INDEX_SCALE;
}
+
+ default boolean deriveJsonColumnFromIndexes()
Review Comment:
in my opinion i don't think we should add this since its just like a system
level thing to control what is a segment level thing, though if we do leave it
we should add javadoc to indicate that this is like a developer testing control
knob, not something an operator would ever want to set (in case anyone stumbled
on it in the code or a future developer wonders why it doesn't have docs or
whatever)
##########
processing/src/main/java/org/apache/druid/segment/data/CompressedVariableSizedBlobColumnSerializer.java:
##########
@@ -51,13 +54,15 @@ public class CompressedVariableSizedBlobColumnSerializer
implements Serializer
public CompressedVariableSizedBlobColumnSerializer(
final String filenameBase,
final SegmentWriteOutMedium segmentWriteOutMedium,
+ final ObjectStorageEncoding objectStorageEncoding,
final CompressionStrategy compression
)
{
this.filenameBase = filenameBase;
this.offsetsFile = getCompressedOffsetsFileName(filenameBase);
this.blobsFile = getCompressedBlobsFileName(filenameBase);
this.segmentWriteOutMedium = segmentWriteOutMedium;
+ this.objectStorageEncoding = objectStorageEncoding;
Review Comment:
this should not be passed into
`CompressedVariableSizedBlobColumnSerializer`, since it is a json column
specific thing. I think all of the changes to this and
`CompressedVariableSizedBlobColumnSupplier` need to be reverted, and should be
handled entirely within nested data column layer, which should not create this
serializer (or store the internal 'raw' file that results from it) at all. The
`objectStorageEncoding` can be passed into the column supplier from
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/serde/NestedCommonFormatColumnPartSerde.java#L343,
which the read method can use to know whether or not to bother with the
`CompressedVariableSizedBlobColumnSupplier`.
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -382,17 +418,32 @@ public void inspectRuntimeShape(RuntimeShapeInspector
inspector)
@Override
public VectorObjectSelector makeVectorObjectSelector(ReadableVectorOffset
offset)
{
- final TKeyDictionary fields = fieldsSupplier.get();
- if (!logicalType.equals(ColumnType.NESTED_DATA) && fields.size() == 1 &&
rootFieldPath.equals(StringUtils.fromUtf8(fields.get(0)))) {
+ if (!logicalType.equals(ColumnType.NESTED_DATA)
+ && fieldPathMap.size() == 1
+ &&
rootFieldPath.equals(Iterables.getOnlyElement(fieldPathMap.values()).lhs)) {
return makeVectorObjectSelector(
Collections.emptyList(),
null /* not used */,
offset
);
}
- if (compressedRawColumn == null) {
+
+ if (compressedRawColumn == null &&
!columnConfig.deriveJsonColumnFromIndexes()) {
compressedRawColumn = closer.register(compressedRawColumnSupplier.get());
}
+ AtomicInteger rowNumber = new AtomicInteger(-1);
+ AtomicIntegerReadableOffset atomicOffset = new
AtomicIntegerReadableOffset(rowNumber);
+ final List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors
=
+ compressedRawColumn != null ? null :
Review Comment:
maybe we should just make the `compressedRawColumnSupplier` nullable instead
of checking the thing from the supplier is null?
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -329,28 +330,54 @@ public Object getRowValue(int rowNum)
return null;
}
- if (compressedRawColumn == null) {
+ if (compressedRawColumn == null &&
!columnConfig.deriveJsonColumnFromIndexes()) {
compressedRawColumn = closer.register(compressedRawColumnSupplier.get());
}
- final ByteBuffer valueBuffer = compressedRawColumn.get(rowNum);
- return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+ if (compressedRawColumn != null) {
+ final ByteBuffer valueBuffer = compressedRawColumn.get(rowNum);
+ return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+ }
+
+ ReadableOffset offset = new AtomicIntegerReadableOffset(new
AtomicInteger(rowNum));
+ final List<StructuredDataBuilder.Element> elements =
+ fieldPathMap.keySet().stream()
+ .map(path -> StructuredDataBuilder.Element.of(
+ path,
+
(Objects.requireNonNull(getColumnHolder(path)).getColumn()).makeColumnValueSelector(offset)
+
.getObject()
+ ))
+ .collect(Collectors.toList());
Review Comment:
making new offset and column value selector for every row seems pretty
unchill... while I don't think much should be calling this method since we
implement `makeColumnValueSelector`, could you look further into whether or not
something will call this? If something does, it would be good to know to either
consider changing it or knowing if it would be ok. or if nothing is expected to
call this maybe we throw an exception
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -361,8 +388,17 @@ public Object getObject()
if (nullValues.get(offset.getOffset())) {
return null;
}
- final ByteBuffer valueBuffer =
compressedRawColumn.get(offset.getOffset());
- return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+ if (compressedRawColumn != null) {
+ final ByteBuffer valueBuffer =
compressedRawColumn.get(offset.getOffset());
+ return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+ }
+ List<StructuredDataBuilder.Element> elements =
+ Objects.requireNonNull(fieldSelectors)
+ .stream()
+ .map(c -> StructuredDataBuilder.Element.of(c.lhs,
c.rhs.getObject()))
+ .collect(Collectors.toList());
+ return new StructuredDataBuilder(elements).build();
+
Review Comment:
this might be a bit cleaner to just make separate `ObjectColumnSelector`
implementations depending on object encoding (as we add more object encodings
in the future).
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -434,8 +485,17 @@ private Object getForOffset(int offset)
// maybe someday can use bitmap batch operations for nulls?
return null;
}
- final ByteBuffer valueBuffer = compressedRawColumn.get(offset);
- return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+ if (compressedRawColumn != null) {
+ final ByteBuffer valueBuffer = compressedRawColumn.get(offset);
+ return STRATEGY.fromByteBuffer(valueBuffer, valueBuffer.remaining());
+ } else {
+ rowNumber.set(offset);
+ List<StructuredDataBuilder.Element> elements = fieldSelectors
+ .stream()
+ .map(c -> StructuredDataBuilder.Element.of(c.lhs,
c.rhs.getObject()))
+ .collect(Collectors.toList());
+ return new StructuredDataBuilder(elements).build();
+ }
Review Comment:
same as non-vectorized, i think this would be cleaner to have separate
`VectorObjectSelector` implementations returned depending on object encoding
rather than checking inside every vector
##########
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java:
##########
@@ -156,7 +160,12 @@ public CompressedNestedDataComplexColumn(
this.columnName = columnName;
this.logicalType = logicalType;
this.nullValues = nullValues;
- this.fieldsSupplier = fieldsSupplier;
+ final TKeyDictionary fields = fieldsSupplier.get();
+ this.fieldPathMap =
CollectionUtils.newLinkedHashMapWithExpectedSize(fields.size());
+ for (int i = 0; i < fields.size(); i++) {
+ String field = StringUtils.fromUtf8(fields.get(i));
+ fieldPathMap.put(parsePath(field), Pair.of(field, i));
+ }
Review Comment:
i kind of think this should be lazy computed, really complicated json
objects can have thousands of paths, and we really only need to know stuff
about all of them if the query is doing something like json_keys/json_paths (or
this new thing of reconstructing the object from the list of paths). If a query
just has a json_value for a single field of the object, then it seems like it
would be penalized by having to build this map for a bunch of paths it doesn't
care about.
If changed to be lazy computed, the column selector methods for fields (and
any other methods that take a list of path parts like `getFieldLogicalType`)
should not use this map unless it was computed, since those are the things
called when we want to make a selector for a specific path directly by things
like NestedFieldVirtualColumn and other nested column specialized virtual
columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]