yugan95 opened a new issue, #7620: URL: https://github.com/apache/paimon/issues/7620
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Paimon version master (also affects 1.x releases) ### Compute Engine Spark 3.4 (engine-independent, core module issue) ### Minimal reproduce step Write a primary-key table where some rows contain very large binary/string columns (e.g. 100MB+ per record). The table uses the default LSM-Tree storage with external sort and compaction enabled. ``` sql CREATE TABLE T ( id INT NOT NULL, payload BYTES ) TBLPROPERTIES ( 'primary-key' = 'id', 'bucket' = '1' ); -- Insert rows where `payload` is ~100MB each INSERT INTO T VALUES (1, <100MB_bytes>), (2, <100MB_bytes>), ...; ``` After a few flush/compaction cycles, the TaskManager / Executor runs out of memory with OOM errors. ### What doesn't meet your expectations? Heap dump analysis reveals **four independent memory leak / overflow issues** when handling large records: **1. Sort path — `RowHelper` internal buffer never shrinks** `RowHelper.reuseWriter` grows its internal `MemorySegment` list to accommodate large records (e.g. 100MB+), but `BinaryRowWriter.reset()` only resets the cursor without releasing the oversized segments. Since `InternalRowSerializer.serialize()` can exit via `EOFException` (a normal signal when the sort buffer is full), the bloated buffer is never released. **2. Merge path — `BinaryRowSerializer.deserialize(reuse)` only grows, never shrinks** During external merge sort, each merge channel holds a `BinaryRow` reuse instance. When a large record is deserialized, the backing `MemorySegment` grows to fit it. Subsequent small records reuse the oversized buffer. With `max-num-file-handles` (default 128) merge channels, each retaining a 100MB+ buffer, memory usage explodes. **3. Compaction read path — `HeapBytesVector.reserveBytes()` integer overflow** `reserveBytes()` computes `newCapacity * 2` using plain multiplication. When `newCapacity` exceeds ~1.07 billion bytes, this overflows `Integer.MAX_VALUE`, producing a negative or zero value, which causes `Arrays.copyOf()` to throw `NegativeArraySizeException` or silently corrupt data. **4. Parquet write — statistics and page-size-check config not passed through** `RowDataParquetBuilder` does not pass through several important Parquet configuration properties: - **`parquet.statistics.truncate.length`** — controls truncation of min/max statistics. Defaults to `Integer.MAX_VALUE`, causing full 100MB+ values to be stored in column chunk metadata, ballooning the Parquet footer. - **`parquet.columnindex.truncate.length`** — same issue for column index entries. - **`parquet.page.size.row.check.min`** — minimum row count before checking page size. The default (100) means Parquet accumulates 100 rows before the first page-size check. For large records, this can cause a single page to balloon to several GB before being flushed. - **`parquet.page.size.row.check.max`** — maximum row count between page-size checks. The default (10000) is far too high for large-record workloads, delaying page flushes and causing excessive memory usage. Without these configs being passed through, users have no way to tune Parquet's behavior for large-record scenarios. ### Anything else? **Root cause analysis via heap dump:** | Path | Class | Issue | |---|---|---| | Sort | `RowHelper` | `reuseWriter` segments grow but never shrink; `EOFException` exit path skips cleanup | | Merge | `BinaryRowSerializer` | `deserialize(reuse)` only grows backing `MemorySegment`, never shrinks | | Compaction | `HeapBytesVector` | `newCapacity * 2` overflows `Integer.MAX_VALUE` | | Parquet write | `RowDataParquetBuilder` | Statistics/column-index truncate length and page-size-check row counts not configurable | ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
