[GitHub] [druid] clintropolis commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

GitBox Sat, 22 Oct 2022 18:44:04 -0700


clintropolis commented on code in PR #12277:
URL: https://github.com/apache/druid/pull/12277#discussion_r1002613660



##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSupplier.java:
##########
@@ -74,7 +79,32 @@ public NestedDataColumnSupplier(
             mapper,
             NestedDataColumnSerializer.STRING_DICTIONARY_FILE_NAME
         );
-        dictionary = GenericIndexed.read(stringDictionaryBuffer, 
GenericIndexed.STRING_STRATEGY, mapper);
+
+        final int dictionaryStartPosition = stringDictionaryBuffer.position();
+        final byte dictionaryVersion = stringDictionaryBuffer.get();
+
+        if (dictionaryVersion == EncodedStringDictionaryWriter.VERSION) {
+          final byte encodingId = stringDictionaryBuffer.get();
+          if (encodingId == StringEncodingStrategy.FRONT_CODED_ID) {
+            frontCodedDictionary = 
FrontCodedIndexed.read(stringDictionaryBuffer, metadata.getByteOrder());
+            dictionary = null;
+          } else if (encodingId == StringEncodingStrategy.UTF8_ID) {
+            // this cannot happen naturally right now since generic indexed is 
written in the 'legacy' format, but
+            // this provides backwards compatibility should we switch at some 
point in the future to always
+            // writing dictionaryVersion
+            dictionary = GenericIndexed.read(stringDictionaryBuffer, 
GenericIndexed.BYTE_BUFFER_STRATEGY, mapper);
+            frontCodedDictionary = null;

Review Comment:
   Ah, so I decided to just make it a `Supplier` for users of 
`FrontCodedIndexed` so that I don't have to do something like what nearly all 
of the actual users of `GenericIndexed` are doing, which is call to 
`singleThreaded()` on the top level column `GenericIndexed` dictionary to get 
an optimized version that isn't thread-safe for use in a single thread so that 
it can create less garbage. By using a supplier I can avoid all this since all 
threads just get their own copy (which is basically what all callers are doing 
anyway).



##########
processing/src/main/java/org/apache/druid/segment/nested/NestedDataColumnSupplier.java:
##########
@@ -74,7 +79,32 @@ public NestedDataColumnSupplier(
             mapper,
             NestedDataColumnSerializer.STRING_DICTIONARY_FILE_NAME
         );
-        dictionary = GenericIndexed.read(stringDictionaryBuffer, 
GenericIndexed.STRING_STRATEGY, mapper);
+
+        final int dictionaryStartPosition = stringDictionaryBuffer.position();
+        final byte dictionaryVersion = stringDictionaryBuffer.get();
+
+        if (dictionaryVersion == EncodedStringDictionaryWriter.VERSION) {
+          final byte encodingId = stringDictionaryBuffer.get();
+          if (encodingId == StringEncodingStrategy.FRONT_CODED_ID) {
+            frontCodedDictionary = 
FrontCodedIndexed.read(stringDictionaryBuffer, metadata.getByteOrder());
+            dictionary = null;
+          } else if (encodingId == StringEncodingStrategy.UTF8_ID) {
+            // this cannot happen naturally right now since generic indexed is 
written in the 'legacy' format, but
+            // this provides backwards compatibility should we switch at some 
point in the future to always
+            // writing dictionaryVersion
+            dictionary = GenericIndexed.read(stringDictionaryBuffer, 
GenericIndexed.BYTE_BUFFER_STRATEGY, mapper);
+            frontCodedDictionary = null;

Review Comment:
   Ah, good question. So I decided to just make it a `Supplier` for users of 
`FrontCodedIndexed` so that I don't have to do something like what nearly all 
of the actual users of `GenericIndexed` are doing, which is call to 
`singleThreaded()` on the top level column `GenericIndexed` dictionary to get 
an optimized version that isn't thread-safe for use in a single thread so that 
it can create less garbage. By using a supplier I can avoid all this since all 
threads just get their own copy (which is basically what all callers are doing 
anyway).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on a diff in pull request #12277: add support for 'front coded' string dictionaries for smaller string columns

Reply via email to