clintropolis commented on code in PR #12277:
URL: https://github.com/apache/druid/pull/12277#discussion_r1002613660
##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSupplier.java:
##########
@@ -74,7 +79,32 @@ public NestedDataColumnSupplier(
mapper,
NestedDataColumnSerializer.STRING_DICTIONARY_FILE_NAME
);
- dictionary = GenericIndexed.read(stringDictionaryBuffer,
GenericIndexed.STRING_STRATEGY, mapper);
+
+ final int dictionaryStartPosition = stringDictionaryBuffer.position();
+ final byte dictionaryVersion = stringDictionaryBuffer.get();
+
+ if (dictionaryVersion == EncodedStringDictionaryWriter.VERSION) {
+ final byte encodingId = stringDictionaryBuffer.get();
+ if (encodingId == StringEncodingStrategy.FRONT_CODED_ID) {
+ frontCodedDictionary =
FrontCodedIndexed.read(stringDictionaryBuffer, metadata.getByteOrder());
+ dictionary = null;
+ } else if (encodingId == StringEncodingStrategy.UTF8_ID) {
+ // this cannot happen naturally right now since generic indexed is
written in the 'legacy' format, but
+ // this provides backwards compatibility should we switch at some
point in the future to always
+ // writing dictionaryVersion
+ dictionary = GenericIndexed.read(stringDictionaryBuffer,
GenericIndexed.BYTE_BUFFER_STRATEGY, mapper);
+ frontCodedDictionary = null;
Review Comment:
Ah, so I decided to just make it a `Supplier` for users of
`FrontCodedIndexed` so that I don't have to do something like what nearly all
of the actual users of `GenericIndexed` are doing, which is call to
`singleThreaded()` on the top level column `GenericIndexed` dictionary to get
an optimized version that isn't thread-safe for use in a single thread so that
it can create less garbage. By using a supplier I can avoid all this since all
threads just get their own copy (which is basically what all callers are doing
anyway).
##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSupplier.java:
##########
@@ -74,7 +79,32 @@ public NestedDataColumnSupplier(
mapper,
NestedDataColumnSerializer.STRING_DICTIONARY_FILE_NAME
);
- dictionary = GenericIndexed.read(stringDictionaryBuffer,
GenericIndexed.STRING_STRATEGY, mapper);
+
+ final int dictionaryStartPosition = stringDictionaryBuffer.position();
+ final byte dictionaryVersion = stringDictionaryBuffer.get();
+
+ if (dictionaryVersion == EncodedStringDictionaryWriter.VERSION) {
+ final byte encodingId = stringDictionaryBuffer.get();
+ if (encodingId == StringEncodingStrategy.FRONT_CODED_ID) {
+ frontCodedDictionary =
FrontCodedIndexed.read(stringDictionaryBuffer, metadata.getByteOrder());
+ dictionary = null;
+ } else if (encodingId == StringEncodingStrategy.UTF8_ID) {
+ // this cannot happen naturally right now since generic indexed is
written in the 'legacy' format, but
+ // this provides backwards compatibility should we switch at some
point in the future to always
+ // writing dictionaryVersion
+ dictionary = GenericIndexed.read(stringDictionaryBuffer,
GenericIndexed.BYTE_BUFFER_STRATEGY, mapper);
+ frontCodedDictionary = null;
Review Comment:
Ah, good question. So I decided to just make it a `Supplier` for users of
`FrontCodedIndexed` so that I don't have to do something like what nearly all
of the actual users of `GenericIndexed` are doing, which is call to
`singleThreaded()` on the top level column `GenericIndexed` dictionary to get
an optimized version that isn't thread-safe for use in a single thread so that
it can create less garbage. By using a supplier I can avoid all this since all
threads just get their own copy (which is basically what all callers are doing
anyway).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]