Hi Murali,

Thank you for the thorough review -- these are excellent questions
that have helped sharpen the design.

mb-1 (Read path format detection)
Good catch -- the KIP was unclear on this. The design uses a two-layer approach:
Primary (write path): MetadataVersion controls which format is used
for new index files. Before IBP_4_4_IV1 finalization, all new indexes
are written in the legacy 8-byte format. After finalization, new
indexes use the 12-byte format.
Fallback (read path): When opening an existing index file, the format
is auto-detected from the file content -- no filename suffix or marker
needed. The detection analyzes the file size (divisible by 8 only? by
12 only?) and, for ambiguous sizes (divisible by both), validates the
first few entries with each format to determine which interpretation
produces valid, monotonically ordered data. If both interpretations
are valid (rare), the MetadataVersion hint is used as the tiebreaker.
This means after finalization, a log directory may contain both 8-byte
and 12-byte index files, and each will be read correctly. When an
old-format index is rebuilt (e.g., during recover()), it will be
written in the new format.
I will update the KIP to document this mechanism in detail.

mb-2 (RemoteIndexCache)
RemoteIndexCache will rely on the same auto-detection mechanism, since
it fetches index files from tiered storage without knowing which
format was active when the segment was uploaded. The auto-detection
reads the file content and determines the correct format, so a 12-byte
index fetched from remote storage will never be misread as legacy
format. I will add this to the KIP.

mb-3 (Index size guidance)
Good point. I will add concrete guidance to the KIP: with 12-byte
entries, log.index.size.max.bytes (default 10 MB) holds approximately
873K entries. For an 8 GB segment with the default
log.index.interval.bytes of 4096 bytes, approximately 2M index entries
would be needed, requiring approximately 24 MB of index space.
Operators using segments larger than 2 GB should increase
log.index.size.max.bytes proportionally.

mb-4 (SegmentPosition rationale)
SegmentPosition.relativePosition is constructed from
LogOffsetMetadata.relativePositionInSegment, which will become long.
Widening SegmentPosition to match avoids lossy truncation at the
boundary. In practice KRaft metadata segments are small (~1 GB), so
this is purely for type consistency, but it prevents subtle bugs if
the types ever diverge. I will add this rationale to the KIP.

mb-5 (BaseRecords.sizeInBytes() and internal callers)
You are absolutely right -- the example you pointed out (int
physicalPosition = log.sizeInBytes() in LogSegment) would be a real
bug for segments larger than 2 GB. This and all similar internal
callers will be migrated to a new sizeInBytesLong() method that
returns the true long size. The existing sizeInBytes() will clamp at
Integer.MAX_VALUE.
The reason clamping is safe for the remaining ~449 callers is that
they are in the client/network layer, where individual fetch responses
are bounded by max.partition.fetch.bytes (an int config, default 1
MB). A single fetch response will never need to represent more than ~2
GB of data, so the clamped int return value is sufficient for those
call sites.
I will add a complete table of all migrated internal call sites to the
KIP so nothing is missed during implementation review.

mb-6 (Internal type changes)
Yes, exactly right -- all of these position and size variables in
LogSegment (and cascading callers in LocalLog, Cleaner, LogLoader,
UnifiedLog, etc.) will be widened from int to long. Thank you for
calling these out specifically. I will capture every such change in a
table in the KIP's Proposed Changes section, so reviewers can verify
completeness during implementation.

Thanks again for the detailed review, Murali. I will update the KIP
with all of these clarifications.

Best, Haifeng

Reply via email to