jbbqqf opened a new pull request, #16265:
URL: https://github.com/apache/iceberg/pull/16265

   ## Summary
   
   `BaseFile.splitOffsets()` previously wrapped the underlying `long[]` in a 
fresh `ArrayUtil.toUnmodifiableLongList` on every invocation. Manifest 
rewriting, format conversion, and `toString()` debugging can call 
`splitOffsets()` many times for the same file, leaking a `List<Long>` wrapper 
each call.
   
   This PR caches the unmodifiable view in a transient field, populated lazily 
on the first call and invalidated wherever the underlying `long[]` is 
reassigned.
   
   Resolves #15622
   
   ## Context
   
   The original ask in #15622 — *"BaseFile.splitOffsets() allocates a new List 
on every call"* — was previously addressed in #15625 / #15624, both of which 
were stale-closed at 30 days with no review (no technical objections). This PR 
is a fresh, minimal revival of that approach with explicit invalidation at 
every write site, regression tests covering the cache contract, and AI 
disclosure.
   
   The trade-off is small: one extra reference field per `BaseFile` (a few 
bytes), zero serialization impact (transient), and a one-line guard in 
`splitOffsets()` itself. Hot paths that call `splitOffsets()` repeatedly stop 
paying the per-call allocation cost.
   
   ## Changes
   
   - `core/src/main/java/org/apache/iceberg/BaseFile.java`
     - Add a transient `List<Long> splitOffsetsList` cache field.
     - `splitOffsets()` populates the cache on first call and reuses it 
thereafter; corrupted offsets still return `null` stably.
     - All three sites that reassign the `long[] splitOffsets` field (the 
long-form constructor, the copy constructor, and the indexed setter `case 14`) 
now also null out the cache so the next `splitOffsets()` call rebuilds it from 
the new array.
     - Inline comment on the cache field explains the why and points to the 
issue, so a reviewer reading the diff cold doesn't have to re-derive it.
   - `core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java` 
*(new)* — four tests covering: repeated-call instance reuse; 
null-when-corrupted contract is stable across calls; copy-constructor produces 
an independent cached view; the returned list remains unmodifiable.
   - `core/src/test/java/org/apache/iceberg/TestManifestReader.java` — extends 
`FILE_COMPARISON_CONFIG` to ignore the new transient cache field. Recursive 
equality on data files should not depend on whether `splitOffsets()` / 
`toString()` happened to be called on one side of the comparison before the 
assertion. (This is the same pattern the config already uses for 
`manifestLocation`, `fileOrdinal`, `fromProjectionPos`, etc.)
   
   ## Reproduce BEFORE/AFTER yourself (copy-paste)
   
   ```bash
   # --- one-time setup ---
   git clone https://github.com/apache/iceberg.git /tmp/repro && cd /tmp/repro
   
   # --- BEFORE (origin/main) ---
   git checkout origin/main
   # Pull in only the new test class to make the regression visible against
   # the unmodified BaseFile in main:
   git fetch https://github.com/jbbqqf/iceberg.git 
feat/15622-cache-split-offsets-list
   git checkout FETCH_HEAD -- 
core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java
   ./gradlew :iceberg-core:test --tests 
"org.apache.iceberg.TestBaseFileSplitOffsets"
   # Expected: 4 tests, 2 failures —
   #   splitOffsetsReturnsSameInstanceOnRepeatedCalls (allocates fresh List per 
call)
   #   copyConstructorProducesIndependentCachedView   (same)
   # The other two pass on main because they don't depend on caching.
   
   # --- AFTER (this PR) ---
   git checkout FETCH_HEAD
   ./gradlew :iceberg-core:test --tests 
"org.apache.iceberg.TestBaseFileSplitOffsets"
   # Expected: 4 tests, 4 passes — splitOffsets() reuses a cached
   # unmodifiable view, invalidated at every write site.
   ```
   
   The only thing that changes between BEFORE and AFTER is the checked-out ref 
of `BaseFile.java`; the test file is the same in both runs.
   
   ## What I ran locally
   
   - `./gradlew :iceberg-core:test --tests 
"org.apache.iceberg.TestBaseFileSplitOffsets"` on `origin/main` with the new 
test imported → **2/4 fail** (regression visible).
   - Same on this branch → **4/4 pass**.
   - `./gradlew :iceberg-core:test --tests 
"org.apache.iceberg.TestBaseFileSplitOffsets" --tests 
"org.apache.iceberg.TestContentFileParser" --tests 
"org.apache.iceberg.TestManifestReader" --tests 
"org.apache.iceberg.TestTrackedFileStruct"` → **all green**.
   - `./gradlew :iceberg-core:spotlessCheck` → **PASS**.
   
   ## Edge cases tested
   
   | # | Scenario | Input | Expected | Verified by |
   |---|----------|-------|----------|-------------|
   | 1 | Repeated `splitOffsets()` calls on a single file | well-formed offsets 
`[0, 1024, 2048]` | both calls return the **same** `List<Long>` instance, with 
the same contents | `splitOffsetsReturnsSameInstanceOnRepeatedCalls` |
   | 2 | Corrupted offsets (last offset >= file size) | `fileSize=100`, offsets 
`[0, 50, 200]` | both calls return `null` (caching must not turn null into a 
stale empty list) | `splitOffsetsNullWhenCorruptedThenReturnsNullStably` |
   | 3 | Copy constructor invalidates the cache | copy a file that already has 
`splitOffsets()` called | copy returns its own list with equal contents but 
different identity, **and** copy's own cache is stable across repeated calls | 
`copyConstructorProducesIndependentCachedView` |
   | 4 | The returned list remains unmodifiable | `splitOffsets().set(0, 99L)` 
| throws `UnsupportedOperationException` | `splitOffsetsListIsUnmodifiable` |
   | 5 | Recursive comparison on `BaseFile` (existing test) | manifest 
read+filter, equality vs constructed file | passes regardless of whether 
`toString()` populated the cache on one side (cache field excluded from 
comparison) | `TestManifestReader#testReaderWithFilterWithoutSelect` |
   
   ## Risk / blast radius
   
   - `BaseFile` is the shared superclass for `DataFile`/`DeleteFile`. The cache 
is internal, transient, and only read inside `splitOffsets()`.
   - Wire/serialization: unaffected — `transient` means the field is excluded 
from Avro reflection and Java serialization. Existing `equals`/`hashCode` 
(delegated to `SupportsIndexProjection` / `internalGet`) are also unaffected 
because they go through the indexed schema, not arbitrary fields.
   - Memory: one extra reference field per `BaseFile` (≈ 4 bytes on 
compressed-oop heaps), plus the cached `List<Long>` itself when populated 
(which is the same wrapper that today is allocated and discarded on every call 
— net memory pressure decreases for the common case).
   - Concurrency: `BaseFile` instances are not documented as thread-safe across 
mutation, and the cached field is a benign-data-race scenario (idempotent 
population of an immutable view); no extra synchronization is introduced.
   
   Additive change. No public API surface modified.
   
   ## Release note
   
   ```release-note
   BaseFile.splitOffsets() now caches its unmodifiable List<Long> view, 
eliminating per-call allocations during manifest rewrites and format 
conversions.
   ```
   
   ---
   
   *PR drafted with assistance from Claude Code. The change was reviewed 
manually against `apache/iceberg`'s source on `main`. The reproducer block 
above was used during development; it is the same one a reviewer can paste 
verbatim.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to