I have an initial PR https://github.com/acgtun/kafka/pull/1
On Mon, May 11, 2026 at 10:01 AM Haifeng Chen <[email protected]> wrote: > > Hi Murali, > > Thank you for the thorough review -- these are excellent questions > that have helped sharpen the design. > > mb-1 (Read path format detection) > Good catch -- the KIP was unclear on this. The design uses a two-layer > approach: > Primary (write path): MetadataVersion controls which format is used > for new index files. Before IBP_4_4_IV1 finalization, all new indexes > are written in the legacy 8-byte format. After finalization, new > indexes use the 12-byte format. > Fallback (read path): When opening an existing index file, the format > is auto-detected from the file content -- no filename suffix or marker > needed. The detection analyzes the file size (divisible by 8 only? by > 12 only?) and, for ambiguous sizes (divisible by both), validates the > first few entries with each format to determine which interpretation > produces valid, monotonically ordered data. If both interpretations > are valid (rare), the MetadataVersion hint is used as the tiebreaker. > This means after finalization, a log directory may contain both 8-byte > and 12-byte index files, and each will be read correctly. When an > old-format index is rebuilt (e.g., during recover()), it will be > written in the new format. > I will update the KIP to document this mechanism in detail. > > mb-2 (RemoteIndexCache) > RemoteIndexCache will rely on the same auto-detection mechanism, since > it fetches index files from tiered storage without knowing which > format was active when the segment was uploaded. The auto-detection > reads the file content and determines the correct format, so a 12-byte > index fetched from remote storage will never be misread as legacy > format. I will add this to the KIP. > > mb-3 (Index size guidance) > Good point. I will add concrete guidance to the KIP: with 12-byte > entries, log.index.size.max.bytes (default 10 MB) holds approximately > 873K entries. For an 8 GB segment with the default > log.index.interval.bytes of 4096 bytes, approximately 2M index entries > would be needed, requiring approximately 24 MB of index space. > Operators using segments larger than 2 GB should increase > log.index.size.max.bytes proportionally. > > mb-4 (SegmentPosition rationale) > SegmentPosition.relativePosition is constructed from > LogOffsetMetadata.relativePositionInSegment, which will become long. > Widening SegmentPosition to match avoids lossy truncation at the > boundary. In practice KRaft metadata segments are small (~1 GB), so > this is purely for type consistency, but it prevents subtle bugs if > the types ever diverge. I will add this rationale to the KIP. > > mb-5 (BaseRecords.sizeInBytes() and internal callers) > You are absolutely right -- the example you pointed out (int > physicalPosition = log.sizeInBytes() in LogSegment) would be a real > bug for segments larger than 2 GB. This and all similar internal > callers will be migrated to a new sizeInBytesLong() method that > returns the true long size. The existing sizeInBytes() will clamp at > Integer.MAX_VALUE. > The reason clamping is safe for the remaining ~449 callers is that > they are in the client/network layer, where individual fetch responses > are bounded by max.partition.fetch.bytes (an int config, default 1 > MB). A single fetch response will never need to represent more than ~2 > GB of data, so the clamped int return value is sufficient for those > call sites. > I will add a complete table of all migrated internal call sites to the > KIP so nothing is missed during implementation review. > > mb-6 (Internal type changes) > Yes, exactly right -- all of these position and size variables in > LogSegment (and cascading callers in LocalLog, Cleaner, LogLoader, > UnifiedLog, etc.) will be widened from int to long. Thank you for > calling these out specifically. I will capture every such change in a > table in the KIP's Proposed Changes section, so reviewers can verify > completeness during implementation. > > Thanks again for the detailed review, Murali. I will update the KIP > with all of these clarifications. > > Best, Haifeng > > On 2026/05/09 16:36:42 Muralidhar Basani via dev wrote: > > Hi Chen, thanks for this well written kip. It definitely addresses a core > > issue. > > > > I have a few points to clarify. > > > > mb-1 : It is mentioned as 'The large format is only written after > > MetadataVersion IBP_4_4_IV1 is finalized' and 'New index files are written > > in 12-byte format. Existing 8-byte index files continue to be read > > correctly.'. > > > > But it is not clear how a 8 byte and 12 byte index files are detected by > > OffsetIndex, after the 4.4 is finalized. > > Because log dir may contain 8 byte and 12 byte index files simultaneously. > > Something like a filename suffix ? > > There is a mention of 5 arg constructor, but it controls the writing path, > > but no mechanism mentioned for the read path. > > > > mb-2 : RemoteIndexCache caches index files fetched from tiered storage. Is > > it possible that 12-byte index gets cached, but read back as legacy format ? > > > > mb-3 : Current default for log.index.size.max.bytes is 10mb and > > log.index.interval.bytes default is 4 kb. But to handle larger segments, I > > think this is not enough. Would it be better to mention some guidance or > > other default in the kip? For ex : guidance on value, for a 8 gb segment, > > needs this index cap to be around 24mb ? > > > > mb-4 : There is a change of type for SegmentPosition.relativePosition > > (raft) from int to long. But it is not mentioned why this needs a change? > > > > mb-5 : As per kip, BaseRecords.sizeInBytes() is not changed and keeps it as > > int. But when FileRecords size is > 2 gb, curious what should be returned > > by sizeInBytes ? > > > > And 'BaseRecords.sizeInBytes() remains int (449 callers.. ' is there really > > no impact ? > > > > For ex : LogSegment.java:256, we have int physicalPosition = > > log.sizeInBytes(). If this value is above 2gb, it's value can become wrong, > > pointing to wrong position ? May be there are other references like these. > > > > mb-6 : Also in LogSegment line 446, int startPosition = > > startOffsetAndSize.position; If this value is above 2gb, probably this gets > > changed to long ? > > and in line 460, > > int fetchSize = Math.min((int) (maxPositionOpt.get() - startPosition), > > adjustedMaxSize); > > > > these kinds of changes could be captured while implementing may be, Just > > thought, if they could be mentioned in kip, so we don't miss while > > implementing ? > > > > > > Thanks, > > Murali > > > > On Wed, May 6, 2026 at 12:35 PM Haifeng Chen <[email protected]> > > wrote: > > > > > Hi all, > > > > > > I'd like to propose KIP-1333: Support log segments larger than 2 GB. > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1333%3A+Support+log+segments+larger+than+2+GB > > > > > > Please help take a look and look forward to your feedback. Thanks, > > > > > > Haifeng > > > > >
