jbbqqf opened a new pull request, #16265:
URL: https://github.com/apache/iceberg/pull/16265
## Summary
`BaseFile.splitOffsets()` previously wrapped the underlying `long[]` in a
fresh `ArrayUtil.toUnmodifiableLongList` on every invocation. Manifest
rewriting, format conversion, and `toString()` debugging can call
`splitOffsets()` many times for the same file, leaking a `List<Long>` wrapper
each call.
This PR caches the unmodifiable view in a transient field, populated lazily
on the first call and invalidated wherever the underlying `long[]` is
reassigned.
Resolves #15622
## Context
The original ask in #15622 — *"BaseFile.splitOffsets() allocates a new List
on every call"* — was previously addressed in #15625 / #15624, both of which
were stale-closed at 30 days with no review (no technical objections). This PR
is a fresh, minimal revival of that approach with explicit invalidation at
every write site, regression tests covering the cache contract, and AI
disclosure.
The trade-off is small: one extra reference field per `BaseFile` (a few
bytes), zero serialization impact (transient), and a one-line guard in
`splitOffsets()` itself. Hot paths that call `splitOffsets()` repeatedly stop
paying the per-call allocation cost.
## Changes
- `core/src/main/java/org/apache/iceberg/BaseFile.java`
- Add a transient `List<Long> splitOffsetsList` cache field.
- `splitOffsets()` populates the cache on first call and reuses it
thereafter; corrupted offsets still return `null` stably.
- All three sites that reassign the `long[] splitOffsets` field (the
long-form constructor, the copy constructor, and the indexed setter `case 14`)
now also null out the cache so the next `splitOffsets()` call rebuilds it from
the new array.
- Inline comment on the cache field explains the why and points to the
issue, so a reviewer reading the diff cold doesn't have to re-derive it.
- `core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java`
*(new)* — four tests covering: repeated-call instance reuse;
null-when-corrupted contract is stable across calls; copy-constructor produces
an independent cached view; the returned list remains unmodifiable.
- `core/src/test/java/org/apache/iceberg/TestManifestReader.java` — extends
`FILE_COMPARISON_CONFIG` to ignore the new transient cache field. Recursive
equality on data files should not depend on whether `splitOffsets()` /
`toString()` happened to be called on one side of the comparison before the
assertion. (This is the same pattern the config already uses for
`manifestLocation`, `fileOrdinal`, `fromProjectionPos`, etc.)
## Reproduce BEFORE/AFTER yourself (copy-paste)
```bash
# --- one-time setup ---
git clone https://github.com/apache/iceberg.git /tmp/repro && cd /tmp/repro
# --- BEFORE (origin/main) ---
git checkout origin/main
# Pull in only the new test class to make the regression visible against
# the unmodified BaseFile in main:
git fetch https://github.com/jbbqqf/iceberg.git
feat/15622-cache-split-offsets-list
git checkout FETCH_HEAD --
core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java
./gradlew :iceberg-core:test --tests
"org.apache.iceberg.TestBaseFileSplitOffsets"
# Expected: 4 tests, 2 failures —
# splitOffsetsReturnsSameInstanceOnRepeatedCalls (allocates fresh List per
call)
# copyConstructorProducesIndependentCachedView (same)
# The other two pass on main because they don't depend on caching.
# --- AFTER (this PR) ---
git checkout FETCH_HEAD
./gradlew :iceberg-core:test --tests
"org.apache.iceberg.TestBaseFileSplitOffsets"
# Expected: 4 tests, 4 passes — splitOffsets() reuses a cached
# unmodifiable view, invalidated at every write site.
```
The only thing that changes between BEFORE and AFTER is the checked-out ref
of `BaseFile.java`; the test file is the same in both runs.
## What I ran locally
- `./gradlew :iceberg-core:test --tests
"org.apache.iceberg.TestBaseFileSplitOffsets"` on `origin/main` with the new
test imported → **2/4 fail** (regression visible).
- Same on this branch → **4/4 pass**.
- `./gradlew :iceberg-core:test --tests
"org.apache.iceberg.TestBaseFileSplitOffsets" --tests
"org.apache.iceberg.TestContentFileParser" --tests
"org.apache.iceberg.TestManifestReader" --tests
"org.apache.iceberg.TestTrackedFileStruct"` → **all green**.
- `./gradlew :iceberg-core:spotlessCheck` → **PASS**.
## Edge cases tested
| # | Scenario | Input | Expected | Verified by |
|---|----------|-------|----------|-------------|
| 1 | Repeated `splitOffsets()` calls on a single file | well-formed offsets
`[0, 1024, 2048]` | both calls return the **same** `List<Long>` instance, with
the same contents | `splitOffsetsReturnsSameInstanceOnRepeatedCalls` |
| 2 | Corrupted offsets (last offset >= file size) | `fileSize=100`, offsets
`[0, 50, 200]` | both calls return `null` (caching must not turn null into a
stale empty list) | `splitOffsetsNullWhenCorruptedThenReturnsNullStably` |
| 3 | Copy constructor invalidates the cache | copy a file that already has
`splitOffsets()` called | copy returns its own list with equal contents but
different identity, **and** copy's own cache is stable across repeated calls |
`copyConstructorProducesIndependentCachedView` |
| 4 | The returned list remains unmodifiable | `splitOffsets().set(0, 99L)`
| throws `UnsupportedOperationException` | `splitOffsetsListIsUnmodifiable` |
| 5 | Recursive comparison on `BaseFile` (existing test) | manifest
read+filter, equality vs constructed file | passes regardless of whether
`toString()` populated the cache on one side (cache field excluded from
comparison) | `TestManifestReader#testReaderWithFilterWithoutSelect` |
## Risk / blast radius
- `BaseFile` is the shared superclass for `DataFile`/`DeleteFile`. The cache
is internal, transient, and only read inside `splitOffsets()`.
- Wire/serialization: unaffected — `transient` means the field is excluded
from Avro reflection and Java serialization. Existing `equals`/`hashCode`
(delegated to `SupportsIndexProjection` / `internalGet`) are also unaffected
because they go through the indexed schema, not arbitrary fields.
- Memory: one extra reference field per `BaseFile` (≈ 4 bytes on
compressed-oop heaps), plus the cached `List<Long>` itself when populated
(which is the same wrapper that today is allocated and discarded on every call
— net memory pressure decreases for the common case).
- Concurrency: `BaseFile` instances are not documented as thread-safe across
mutation, and the cached field is a benign-data-race scenario (idempotent
population of an immutable view); no extra synchronization is introduced.
Additive change. No public API surface modified.
## Release note
```release-note
BaseFile.splitOffsets() now caches its unmodifiable List<Long> view,
eliminating per-call allocations during manifest rewrites and format
conversions.
```
---
*PR drafted with assistance from Claude Code. The change was reviewed
manually against `apache/iceberg`'s source on `main`. The reproducer block
above was used during development; it is the same one a reviewer can paste
verbatim.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]