Re: [PR] Various fixes for large columns. (druid)

via GitHub Mon, 03 Feb 2025 09:50:06 -0800


gianm commented on code in PR #17691:
URL: https://github.com/apache/druid/pull/17691#discussion_r1939794731



##########
processing/src/main/java/org/apache/druid/segment/data/V3CompressedVSizeColumnarMultiIntsSerializer.java:
##########
@@ -35,33 +35,47 @@ public class V3CompressedVSizeColumnarMultiIntsSerializer 
extends ColumnarMultiI
 {
   private static final byte VERSION = 
V3CompressedVSizeColumnarMultiIntsSupplier.VERSION;
 
+  /**
+   * Creates a new serializer.
+   *
+   * @param columnName            name of the column to write
+   * @param segmentWriteOutMedium supplier of temporary files
+   * @param filenameBase          base filename to be used for secondary 
files, if multiple files are needed
+   * @param maxValue              maximum integer value that will be written 
to the column
+   * @param compression           compression strategy to apply
+   * @param fileSizeLimit         limit for files created by the writer. In 
production code, this should always be
+   *                              {@link GenericIndexedWriter#MAX_FILE_SIZE}. 
The parameter is exposed only for testing.
+   */
   public static V3CompressedVSizeColumnarMultiIntsSerializer create(
       final String columnName,
       final SegmentWriteOutMedium segmentWriteOutMedium,
       final String filenameBase,
       final int maxValue,
-      final CompressionStrategy compression
+      final CompressionStrategy compression,
+      final int fileSizeLimit
   )
   {
     return new V3CompressedVSizeColumnarMultiIntsSerializer(
         columnName,
         new CompressedColumnarIntsSerializer(
             columnName,
             segmentWriteOutMedium,
-            filenameBase,
+            filenameBase + ".offsets",

Review Comment:
   This was fixing another bug, mentioned in the PR description:
   
   > This patch also includes a fix for an issue on the writer for compressed 
multi-value string columns, `V3CompressedVSizeColumnarMultiIntsSerializer`, 
where it would use the same base filename for both the offset and values 
sections. This bug would only be triggered for segments in excess of 500 
million rows. When a segment has fewer rows than that, it could potentially 
have a values section that needs to be split over multiple files, but the 
offset is never more than 4 bytes per row. This bug was triggered by the new 
tests, which use a smaller fileSizeLimit.
   
   Backward compatibility would be OK, because the seconary filenames are 
written into the primary file. They aren't hard-coded on the reader side. So, 
older readers will still be able to read these differently-named files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Various fixes for large columns. (druid)

Reply via email to