xiangfu0 opened a new pull request, #18772:
URL: https://github.com/apache/pinot/pull/18772
## Summary
`VarByteChunkForwardIndexWriterV6` currently closes a chunk only when the
next
entry would overflow the `chunkSize`-byte buffer. This PR adds an optional
`targetDocsPerChunk` parameter so a chunk can additionally be bounded by
document
count, letting callers control chunk granularity independently of the byte
budget.
- New 4-arg constructor `VarByteChunkForwardIndexWriterV6(file,
compressionType,
chunkSize, targetDocsPerChunk)`. The existing 3-arg constructor delegates
with
`targetDocsPerChunk = -1` (`DISABLE_DOCS_PER_CHUNK`).
- When `targetDocsPerChunk > 0`, a chunk is flushed once it holds that many
docs,
even if the byte buffer isn't full; otherwise behavior is unchanged.
- The buffer-overflow flush predicate is extracted into a protected
`shouldStartNewChunk(int)` hook in `VarByteChunkForwardIndexWriterV4`
(mirroring
the existing `writeChunkHeader` hook), so V6 adds the cap without
duplicating
`putBytes()`.
## Motivation
For raw string/bytes columns, the ZSTD compression ratio depends heavily on
how
many repeated values fall within a single chunk (the dedup window). Being
able to
bound a chunk by document count — not only by bytes — gives finer control
over the
size/granularity tradeoff for repetitive columns.
## Backward compatibility
- The default `-1` reproduces the exact current behavior.
- The on-disk format and writer version (`6`) are unchanged; the target
chunk size
remains self-describing in the file header, so existing and new indexes
stay
mutually readable.
## Testing
- New `VarByteChunkV6Test#testTargetDocsPerChunkCapsChunk` asserts each
capped chunk
holds exactly `targetDocsPerChunk` docs and that values round-trip.
- All inherited V4/V5/V6 read/write tests pass (the `-1` default path).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]