Hi Murali,

Thank you for the thorough review -- these are excellent questions
that have helped sharpen the design.

mb-1 (Read path format detection)
Good catch -- the KIP was unclear on this. The design uses a two-layer approach:
Primary (write path): MetadataVersion controls which format is used
for new index files. Before IBP_4_4_IV1 finalization, all new indexes
are written in the legacy 8-byte format. After finalization, new
indexes use the 12-byte format.
Fallback (read path): When opening an existing index file, the format
is auto-detected from the file content -- no filename suffix or marker
needed. The detection analyzes the file size (divisible by 8 only? by
12 only?) and, for ambiguous sizes (divisible by both), validates the
first few entries with each format to determine which interpretation
produces valid, monotonically ordered data. If both interpretations
are valid (rare), the MetadataVersion hint is used as the tiebreaker.
This means after finalization, a log directory may contain both 8-byte
and 12-byte index files, and each will be read correctly. When an
old-format index is rebuilt (e.g., during recover()), it will be
written in the new format.
I will update the KIP to document this mechanism in detail.

mb-2 (RemoteIndexCache)
RemoteIndexCache will rely on the same auto-detection mechanism, since
it fetches index files from tiered storage without knowing which
format was active when the segment was uploaded. The auto-detection
reads the file content and determines the correct format, so a 12-byte
index fetched from remote storage will never be misread as legacy
format. I will add this to the KIP.

mb-3 (Index size guidance)
Good point. I will add concrete guidance to the KIP: with 12-byte
entries, log.index.size.max.bytes (default 10 MB) holds approximately
873K entries. For an 8 GB segment with the default
log.index.interval.bytes of 4096 bytes, approximately 2M index entries
would be needed, requiring approximately 24 MB of index space.
Operators using segments larger than 2 GB should increase
log.index.size.max.bytes proportionally.

mb-4 (SegmentPosition rationale)
SegmentPosition.relativePosition is constructed from
LogOffsetMetadata.relativePositionInSegment, which will become long.
Widening SegmentPosition to match avoids lossy truncation at the
boundary. In practice KRaft metadata segments are small (~1 GB), so
this is purely for type consistency, but it prevents subtle bugs if
the types ever diverge. I will add this rationale to the KIP.

mb-5 (BaseRecords.sizeInBytes() and internal callers)
You are absolutely right -- the example you pointed out (int
physicalPosition = log.sizeInBytes() in LogSegment) would be a real
bug for segments larger than 2 GB. This and all similar internal
callers will be migrated to a new sizeInBytesLong() method that
returns the true long size. The existing sizeInBytes() will clamp at
Integer.MAX_VALUE.
The reason clamping is safe for the remaining ~449 callers is that
they are in the client/network layer, where individual fetch responses
are bounded by max.partition.fetch.bytes (an int config, default 1
MB). A single fetch response will never need to represent more than ~2
GB of data, so the clamped int return value is sufficient for those
call sites.
I will add a complete table of all migrated internal call sites to the
KIP so nothing is missed during implementation review.

mb-6 (Internal type changes)
Yes, exactly right -- all of these position and size variables in
LogSegment (and cascading callers in LocalLog, Cleaner, LogLoader,
UnifiedLog, etc.) will be widened from int to long. Thank you for
calling these out specifically. I will capture every such change in a
table in the KIP's Proposed Changes section, so reviewers can verify
completeness during implementation.

Thanks again for the detailed review, Murali. I will update the KIP
with all of these clarifications.

Best, Haifeng

On 2026/05/09 16:36:42 Muralidhar Basani via dev wrote:
> Hi Chen, thanks for this well written kip. It definitely addresses a core
> issue.
>
> I have a few points to clarify.
>
> mb-1 : It is mentioned as 'The large format is only written after
> MetadataVersion IBP_4_4_IV1 is finalized' and 'New index files are written
> in 12-byte format. Existing 8-byte index files continue to be read
> correctly.'.
>
> But it is not clear how a 8 byte and 12 byte index files are detected by
> OffsetIndex, after the 4.4 is finalized.
> Because log dir may contain 8 byte and 12 byte index files simultaneously.
> Something like a filename suffix ?
> There is a mention of 5 arg constructor, but it controls the writing path,
> but no mechanism mentioned for the read path.
>
> mb-2 : RemoteIndexCache caches index files fetched from tiered storage. Is
> it possible that 12-byte index gets cached, but read back as legacy format ?
>
> mb-3 : Current default for log.index.size.max.bytes is 10mb and
> log.index.interval.bytes default is 4 kb. But to handle larger segments, I
> think this is not enough. Would it be better to mention some guidance or
> other default in the kip? For ex : guidance on value, for a 8 gb segment,
> needs this index cap to be around 24mb ?
>
> mb-4 : There is a change of type for SegmentPosition.relativePosition
> (raft) from int to long. But it is not mentioned why this needs a change?
>
> mb-5 : As per kip, BaseRecords.sizeInBytes() is not changed and keeps it as
> int. But when FileRecords size is > 2 gb, curious what should be returned
> by sizeInBytes ?
>
> And 'BaseRecords.sizeInBytes() remains int (449 callers.. ' is there really
> no impact ?
>
> For ex : LogSegment.java:256, we have int physicalPosition =
> log.sizeInBytes(). If this value is above 2gb, it's value can become wrong,
> pointing to wrong position ? May be there are other references like these.
>
> mb-6 : Also in LogSegment line 446, int startPosition =
> startOffsetAndSize.position; If this value is above 2gb, probably this gets
> changed to long ?
> and in line 460,
> int fetchSize = Math.min((int) (maxPositionOpt.get() - startPosition),
> adjustedMaxSize);
>
> these kinds of changes could be captured while implementing may be, Just
> thought, if they could be mentioned in kip, so we don't miss while
> implementing ?
>
>
> Thanks,
> Murali
>
> On Wed, May 6, 2026 at 12:35 PM Haifeng Chen <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I'd like to propose KIP-1333: Support log segments larger than 2 GB.
> >
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1333%3A+Support+log+segments+larger+than+2+GB
> >
> > Please help take a look and look forward to your feedback. Thanks,
> >
> > Haifeng
> >
>

Reply via email to