Re: [PR] projection segment merge fixes (druid)

via GitHub Wed, 13 Nov 2024 12:01:34 -0800


clintropolis commented on code in PR #17460:
URL: https://github.com/apache/druid/pull/17460#discussion_r1841079053



##########
processing/src/main/java/org/apache/druid/common/utils/SerializerUtils.java:
##########
@@ -224,4 +239,83 @@ public int getSerializedStringByteSize(String str)
   {
     return Integer.BYTES + StringUtils.toUtf8(str).length;
   }
+
+
+  /**
+   * Write a {@link Serializer} to a {@link Path} and then map it as a 
read-only {@link MappedByteBuffer}
+   */
+  @SuppressFBWarnings("NP_NONNULL_PARAM_VIOLATION")

Review Comment:
   the answer to this is a bit complicated, yea some serializers do not require 
a `FileSmoosher` (basically everything that does not support "large" files 
greater than 2gb does not need a smoosher). Technically none of the current 
callers need a smoosher since the new caller is just taking an `IntBuffer` and 
persisting/mapping it, and the other callers are using `FixedIndexed` which 
does not support large files yet, but accepting a `Serializer` here is kind of 
risky for future misuse.
   
   I think I like two options here.
   
   One option is that `mapSerializer` just makes its own 
`FileSmoosher`/`SmooshedWriter` and just write the serializer into a temporary 
smoosh at the same name as the temp file name, and then just return the loaded 
smoosh instead of a mapped byte buffer directly. This is actually already what 
`DictionaryIdLookup` (where i refactored this code from) is doing for string 
dictionaries, 
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/nested/DictionaryIdLookup.java#L235
 and would let it work with any serializer implementation, and presumably the 
thing we want to read from the mapped file knows how to load other stuff from 
the smoosher.
   
   The other option that seems worth considering is as you said, instead of 
passing in `Serializer` just pass in some other interface that writes with just 
`WriteableByteChannel` and a length argument with the expectation that its only 
going to write to the channel and then map the byte buffer and it fits in the 
buffer. This solves the immediate use case I think, but is sort of limited as 
long as we are stuck with `ByteBuffer`.
   
   I think I'm sort of leaning towards the 'make a new smoosh' option since it 
seems the most flexible, its like "make a smoosh file that contains only the 
contents of this serializer", it does have a little extra overhead for things 
that do fit in a single buffer, but doesn't seem too bad. Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] projection segment merge fixes (druid)

Reply via email to