yugan95 opened a new pull request, #7621: URL: https://github.com/apache/paimon/pull/7621
### Purpose Linked issue: close #7620 Fix OOM when writing table with large records (100MB+) due to unbounded buffer growth in sort, merge and compaction paths. Heap dump analysis identified four independent root causes: **1. Sort path — `RowHelper` internal buffer never shrinks** `RowHelper.reuseWriter` grows its internal `MemorySegment` list for large records, but `BinaryRowWriter.reset()` only resets the cursor without releasing oversized segments. Additionally, `InternalRowSerializer.serialize()` can exit via `EOFException` (a normal signal when the sort buffer is full), skipping any cleanup of the bloated buffer. **2. Merge path — `BinaryRowSerializer.deserialize(reuse)` only grows, never shrinks** Each merge channel holds a `BinaryRow` reuse instance. When a large record is deserialized, the backing `MemorySegment` grows to fit it but is never shrunk for subsequent small records. With `max-num-file-handles` (default 128) channels each retaining a 100MB+ buffer, memory usage explodes. **3. Compaction read path — `HeapBytesVector.reserveBytes()` integer overflow** `reserveBytes()` computes `newCapacity * 2` using plain multiplication. When `newCapacity` exceeds ~1.07 billion bytes, this overflows `Integer.MAX_VALUE`, causing `NegativeArraySizeException` or silent data corruption. **4. Parquet write — statistics and page-size-check config not passed through** `RowDataParquetBuilder` does not pass through `parquet.statistics.truncate.length`, `parquet.columnindex.truncate.length`, `parquet.page.size.row.check.min`, and `parquet.page.size.row.check.max`. Without these, users cannot tune Parquet behavior for large-record scenarios, leading to multi-GB pages and bloated footers. #### Changes 1. **`RowHelper`**: add `resetIfTooLarge()` — release internal buffer when segments exceed 4MB 2. **`InternalRowSerializer`**: call `resetIfTooLarge()` in `finally` block of `serialize()` and `serializeToPages()` to handle `EOFException` exit path 3. **`BinaryRowSerializer`**: add shrink logic in `deserialize(reuse)` — reallocate when existing buffer > 4MB threshold 4. **`HeapBytesVector`**: use bit-shift (`<< 1`) instead of `* 2`, cap at `MAX_ARRAY_SIZE = Integer.MAX_VALUE - 8`, throw clear error on overflow 5. **`RowDataParquetBuilder`**: pass through `statistics.truncate.length`, `columnindex.truncate.length`, `min-row-count-for-page-size-check`, `max-row-count-for-page-size-check` from config ### Tests - `RowHelperTest` — validates `resetIfTooLarge()` releases oversized buffers (> 4MB) and preserves small ones - `BinaryRowSerializerShrinkTest` — validates `deserialize(reuse)` shrinks oversized buffers and preserves small ones - `HeapBytesVectorReserveBytesTest` — validates overflow-safe `reserveBytes()` growth and data correctness ### API and Format N/A — no public API or format changes. ### Documentation N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
