gianm commented on code in PR #17691:
URL: https://github.com/apache/druid/pull/17691#discussion_r1939794731
##########
processing/src/main/java/org/apache/druid/segment/data/V3CompressedVSizeColumnarMultiIntsSerializer.java:
##########
@@ -35,33 +35,47 @@ public class V3CompressedVSizeColumnarMultiIntsSerializer
extends ColumnarMultiI
{
private static final byte VERSION =
V3CompressedVSizeColumnarMultiIntsSupplier.VERSION;
+ /**
+ * Creates a new serializer.
+ *
+ * @param columnName name of the column to write
+ * @param segmentWriteOutMedium supplier of temporary files
+ * @param filenameBase base filename to be used for secondary
files, if multiple files are needed
+ * @param maxValue maximum integer value that will be written
to the column
+ * @param compression compression strategy to apply
+ * @param fileSizeLimit limit for files created by the writer. In
production code, this should always be
+ * {@link GenericIndexedWriter#MAX_FILE_SIZE}.
The parameter is exposed only for testing.
+ */
public static V3CompressedVSizeColumnarMultiIntsSerializer create(
final String columnName,
final SegmentWriteOutMedium segmentWriteOutMedium,
final String filenameBase,
final int maxValue,
- final CompressionStrategy compression
+ final CompressionStrategy compression,
+ final int fileSizeLimit
)
{
return new V3CompressedVSizeColumnarMultiIntsSerializer(
columnName,
new CompressedColumnarIntsSerializer(
columnName,
segmentWriteOutMedium,
- filenameBase,
+ filenameBase + ".offsets",
Review Comment:
This was fixing another bug, mentioned in the PR description:
> This patch also includes a fix for an issue on the writer for compressed
multi-value string columns, `V3CompressedVSizeColumnarMultiIntsSerializer`,
where it would use the same base filename for both the offset and values
sections. This bug would only be triggered for segments in excess of 500
million rows. When a segment has fewer rows than that, it could potentially
have a values section that needs to be split over multiple files, but the
offset is never more than 4 bytes per row. This bug was triggered by the new
tests, which use a smaller fileSizeLimit.
Backward compatibility would be OK, because the seconary filenames are
written into the primary file. They aren't hard-coded on the reader side. So,
older readers will still be able to read these differently-named files.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]