Re: Re: [DISCUSS] KIP-1333: Support log segments larger than 2 GB

Haifeng Chen Mon, 11 May 2026 21:21:02 -0700

I have an initial PR https://github.com/acgtun/kafka/pull/1


On Mon, May 11, 2026 at 10:01 AM Haifeng Chen <[email protected]> wrote:
>
> Hi Murali,
>
> Thank you for the thorough review -- these are excellent questions
> that have helped sharpen the design.
>
> mb-1 (Read path format detection)
> Good catch -- the KIP was unclear on this. The design uses a two-layer 
> approach:
> Primary (write path): MetadataVersion controls which format is used
> for new index files. Before IBP_4_4_IV1 finalization, all new indexes
> are written in the legacy 8-byte format. After finalization, new
> indexes use the 12-byte format.
> Fallback (read path): When opening an existing index file, the format
> is auto-detected from the file content -- no filename suffix or marker
> needed. The detection analyzes the file size (divisible by 8 only? by
> 12 only?) and, for ambiguous sizes (divisible by both), validates the
> first few entries with each format to determine which interpretation
> produces valid, monotonically ordered data. If both interpretations
> are valid (rare), the MetadataVersion hint is used as the tiebreaker.
> This means after finalization, a log directory may contain both 8-byte
> and 12-byte index files, and each will be read correctly. When an
> old-format index is rebuilt (e.g., during recover()), it will be
> written in the new format.
> I will update the KIP to document this mechanism in detail.
>
> mb-2 (RemoteIndexCache)
> RemoteIndexCache will rely on the same auto-detection mechanism, since
> it fetches index files from tiered storage without knowing which
> format was active when the segment was uploaded. The auto-detection
> reads the file content and determines the correct format, so a 12-byte
> index fetched from remote storage will never be misread as legacy
> format. I will add this to the KIP.
>
> mb-3 (Index size guidance)
> Good point. I will add concrete guidance to the KIP: with 12-byte
> entries, log.index.size.max.bytes (default 10 MB) holds approximately
> 873K entries. For an 8 GB segment with the default
> log.index.interval.bytes of 4096 bytes, approximately 2M index entries
> would be needed, requiring approximately 24 MB of index space.
> Operators using segments larger than 2 GB should increase
> log.index.size.max.bytes proportionally.
>
> mb-4 (SegmentPosition rationale)
> SegmentPosition.relativePosition is constructed from
> LogOffsetMetadata.relativePositionInSegment, which will become long.
> Widening SegmentPosition to match avoids lossy truncation at the
> boundary. In practice KRaft metadata segments are small (~1 GB), so
> this is purely for type consistency, but it prevents subtle bugs if
> the types ever diverge. I will add this rationale to the KIP.
>
> mb-5 (BaseRecords.sizeInBytes() and internal callers)
> You are absolutely right -- the example you pointed out (int
> physicalPosition = log.sizeInBytes() in LogSegment) would be a real
> bug for segments larger than 2 GB. This and all similar internal
> callers will be migrated to a new sizeInBytesLong() method that
> returns the true long size. The existing sizeInBytes() will clamp at
> Integer.MAX_VALUE.
> The reason clamping is safe for the remaining ~449 callers is that
> they are in the client/network layer, where individual fetch responses
> are bounded by max.partition.fetch.bytes (an int config, default 1
> MB). A single fetch response will never need to represent more than ~2
> GB of data, so the clamped int return value is sufficient for those
> call sites.
> I will add a complete table of all migrated internal call sites to the
> KIP so nothing is missed during implementation review.
>
> mb-6 (Internal type changes)
> Yes, exactly right -- all of these position and size variables in
> LogSegment (and cascading callers in LocalLog, Cleaner, LogLoader,
> UnifiedLog, etc.) will be widened from int to long. Thank you for
> calling these out specifically. I will capture every such change in a
> table in the KIP's Proposed Changes section, so reviewers can verify
> completeness during implementation.
>
> Thanks again for the detailed review, Murali. I will update the KIP
> with all of these clarifications.
>
> Best, Haifeng
>
> On 2026/05/09 16:36:42 Muralidhar Basani via dev wrote:
> > Hi Chen, thanks for this well written kip. It definitely addresses a core
> > issue.
> >
> > I have a few points to clarify.
> >
> > mb-1 : It is mentioned as 'The large format is only written after
> > MetadataVersion IBP_4_4_IV1 is finalized' and 'New index files are written
> > in 12-byte format. Existing 8-byte index files continue to be read
> > correctly.'.
> >
> > But it is not clear how a 8 byte and 12 byte index files are detected by
> > OffsetIndex, after the 4.4 is finalized.
> > Because log dir may contain 8 byte and 12 byte index files simultaneously.
> > Something like a filename suffix ?
> > There is a mention of 5 arg constructor, but it controls the writing path,
> > but no mechanism mentioned for the read path.
> >
> > mb-2 : RemoteIndexCache caches index files fetched from tiered storage. Is
> > it possible that 12-byte index gets cached, but read back as legacy format ?
> >
> > mb-3 : Current default for log.index.size.max.bytes is 10mb and
> > log.index.interval.bytes default is 4 kb. But to handle larger segments, I
> > think this is not enough. Would it be better to mention some guidance or
> > other default in the kip? For ex : guidance on value, for a 8 gb segment,
> > needs this index cap to be around 24mb ?
> >
> > mb-4 : There is a change of type for SegmentPosition.relativePosition
> > (raft) from int to long. But it is not mentioned why this needs a change?
> >
> > mb-5 : As per kip, BaseRecords.sizeInBytes() is not changed and keeps it as
> > int. But when FileRecords size is > 2 gb, curious what should be returned
> > by sizeInBytes ?
> >
> > And 'BaseRecords.sizeInBytes() remains int (449 callers.. ' is there really
> > no impact ?
> >
> > For ex : LogSegment.java:256, we have int physicalPosition =
> > log.sizeInBytes(). If this value is above 2gb, it's value can become wrong,
> > pointing to wrong position ? May be there are other references like these.
> >
> > mb-6 : Also in LogSegment line 446, int startPosition =
> > startOffsetAndSize.position; If this value is above 2gb, probably this gets
> > changed to long ?
> > and in line 460,
> > int fetchSize = Math.min((int) (maxPositionOpt.get() - startPosition),
> > adjustedMaxSize);
> >
> > these kinds of changes could be captured while implementing may be, Just
> > thought, if they could be mentioned in kip, so we don't miss while
> > implementing ?
> >
> >
> > Thanks,
> > Murali
> >
> > On Wed, May 6, 2026 at 12:35 PM Haifeng Chen <[email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I'd like to propose KIP-1333: Support log segments larger than 2 GB.
> > >
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1333%3A+Support+log+segments+larger+than+2+GB
> > >
> > > Please help take a look and look forward to your feedback. Thanks,
> > >
> > > Haifeng
> > >
> >

Re: Re: [DISCUSS] KIP-1333: Support log segments larger than 2 GB

Reply via email to