Copilot commented on code in PR #18772:
URL: https://github.com/apache/pinot/pull/18772#discussion_r3426242761
##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java:
##########
@@ -24,41 +24,69 @@
import org.apache.pinot.segment.spi.compression.ChunkCompressionType;
-/**
- * Forward index writer that extends {@link VarByteChunkForwardIndexWriterV5}
and delta-encodes the chunk header
- * when compression is enabled, writing individual entry sizes instead of
cumulative byte offsets.
- *
- * - **V4 chunk header**: `[numDocs][offset0][offset1]...[offsetN-1]` —
cumulative byte offsets.
- * - **V6 chunk header** (compressed): `[numDocs][size0][size1]...[sizeN-1]` —
individual entry sizes.
- * Sizes compress dramatically better (e.g. 11x with ZSTD) because they are
small, repetitive values.
- * The reader converts sizes back to offsets at read time with a single
forward pass.
- * - **PASS_THROUGH**: delta encoding provides no benefit, so V6 falls back to
V4's offset-based header.
- *
- * @see VarByteChunkForwardIndexWriterV4
- * @see VarByteChunkForwardIndexWriterV5
- */
+/// Forward index writer that extends [VarByteChunkForwardIndexWriterV5] and
delta-encodes the chunk header
+/// when compression is enabled, writing individual entry sizes instead of
cumulative byte offsets.
+///
+/// - **V4 chunk header**: `[numDocs][offset0][offset1]...[offsetN-1]` —
cumulative byte offsets.
+/// - **V6 chunk header** (compressed): `[numDocs][size0][size1]...[sizeN-1]`
— individual entry sizes.
+/// Sizes compress dramatically better (e.g. 11x with ZSTD) because they are
small, repetitive values.
+/// The reader converts sizes back to offsets at read time with a single
forward pass.
+/// - **PASS_THROUGH**: delta encoding provides no benefit, so V6 falls back
to V4's offset-based header.
+///
+/// ## Chunk boundaries
+/// A chunk is flushed whenever the next entry would overflow the
`chunkSize`-byte chunk buffer
+/// (inherited V4 behavior). V6 additionally accepts a `targetDocsPerChunk`
cap: when positive, a
+/// chunk is also flushed once it reaches that many documents, even if the
byte buffer is not yet full.
+/// This lets callers bound a chunk by both byte size and document count. The
cap defaults to `-1`
+/// (disabled), which preserves the original purely byte-driven behavior.
+///
+/// @see VarByteChunkForwardIndexWriterV4
+/// @see VarByteChunkForwardIndexWriterV5
@NotThreadSafe
public class VarByteChunkForwardIndexWriterV6 extends
VarByteChunkForwardIndexWriterV5 {
public static final int VERSION = 6;
+ /// Sentinel meaning "no docs-per-chunk cap": chunks are bounded only by
`chunkSize` bytes.
+ public static final int DISABLE_DOCS_PER_CHUNK = -1;
+
private final boolean _deltaEncoding;
+ private final int _targetDocsPerChunk;
public VarByteChunkForwardIndexWriterV6(File file, ChunkCompressionType
compressionType, int chunkSize)
throws IOException {
+ this(file, compressionType, chunkSize, DISABLE_DOCS_PER_CHUNK);
+ }
+
+ /// @param file output index file
+ /// @param compressionType chunk compression codec
+ /// @param chunkSize target uncompressed chunk size in bytes (the chunk
buffer capacity)
+ /// @param targetDocsPerChunk flush a chunk once it holds this many
documents, in addition to the
+ /// byte-size limit; `-1` ([#DISABLE_DOCS_PER_CHUNK]) disables the cap
and keeps the purely
+ /// byte-driven behavior
+ public VarByteChunkForwardIndexWriterV6(File file, ChunkCompressionType
compressionType, int chunkSize,
+ int targetDocsPerChunk)
+ throws IOException {
super(file, compressionType, chunkSize);
_deltaEncoding = compressionType != ChunkCompressionType.PASS_THROUGH;
+ _targetDocsPerChunk = targetDocsPerChunk;
Review Comment:
`targetDocsPerChunk` is documented as either positive or `-1` (disabled),
but the constructor currently accepts `0` and other negative values silently
(they behave like "disabled" because the check is `_targetDocsPerChunk > 0`).
This makes accidental misconfiguration easy to miss; please validate the
parameter and fail fast for values other than `-1` or `> 0`.
##########
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/VarByteChunkV6Test.java:
##########
@@ -101,4 +101,46 @@ public void validateCompressionRatioIncrease()
FileUtils.deleteQuietly(v4FwdIndexFile);
FileUtils.deleteQuietly(v6FwdIndexFile);
}
+
+ /// A positive `targetDocsPerChunk` must cap each chunk at that many
documents even when the byte
+ /// buffer is far from full, and values must still round-trip. The default
(`-1`) keeps the original
+ /// byte-driven behavior, which is exercised by all the inherited read/write
tests.
+ @Test
+ public void testTargetDocsPerChunkCapsChunk()
+ throws IOException {
+ int numDocs = 1000;
+ int docsPerChunk = 50;
+ // Large byte budget so flushing is driven purely by the docs-per-chunk
cap, not the buffer size.
+ int chunkSize = 1 << 20;
+ File file = new File(FileUtils.getTempDirectory(), "v6test_docs_cap");
+ FileUtils.deleteQuietly(file);
+
+ String[] values = new String[numDocs];
+ try (VarByteChunkForwardIndexWriterV6 writer =
+ new VarByteChunkForwardIndexWriterV6(file,
ChunkCompressionType.ZSTANDARD, chunkSize, docsPerChunk)) {
+ for (int i = 0; i < numDocs; i++) {
+ values[i] = "value_" + i;
+ writer.putString(values[i]);
+ }
+ }
+
+ try (PinotDataBuffer buffer =
PinotDataBuffer.mapReadOnlyBigEndianFile(file)) {
+ // Chunk metadata starts at byte 16; each entry is 8 bytes. The metadata
region ends at the offset
+ // stored at byte 12 (start of chunk data).
+ int numChunks = (buffer.getInt(12) - 16) / 8;
+ Assert.assertEquals(numChunks, (numDocs + docsPerChunk - 1) /
docsPerChunk,
+ "Each chunk should hold exactly targetDocsPerChunk documents");
+ }
Review Comment:
This test claims to assert that each chunk is capped to `docsPerChunk`, but
it only asserts the *number* of chunks derived from the metadata region size.
That doesn't actually verify per-chunk doc counts. Consider validating chunk
boundaries using the per-chunk metadata docId offsets (difference between
successive chunk start docIds should be `docsPerChunk`, and the last chunk
should contain the remainder).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]