yugan95 opened a new issue, #7620:
URL: https://github.com/apache/paimon/issues/7620

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Paimon version
   
   master (also affects 1.x releases)
   
   ### Compute Engine
   
   Spark 3.4 (engine-independent, core module issue)
   
   ### Minimal reproduce step
   
   Write a primary-key table where some rows contain very large binary/string 
columns (e.g. 100MB+ per record). The table uses the default LSM-Tree storage 
with external sort and compaction enabled.
   
   ``` sql
   CREATE TABLE T (
       id INT NOT NULL,
       payload BYTES
   ) TBLPROPERTIES (
       'primary-key' = 'id',
       'bucket' = '1'
   );
   
   -- Insert rows where `payload` is ~100MB each
   INSERT INTO T VALUES (1, <100MB_bytes>), (2, <100MB_bytes>), ...;
   ```
   After a few flush/compaction cycles, the TaskManager / Executor runs out of 
memory with OOM errors.
   
   ### What doesn't meet your expectations?
   
   Heap dump analysis reveals **four independent memory leak / overflow 
issues** when handling large records:
   
   **1. Sort path — `RowHelper` internal buffer never shrinks**
   
   `RowHelper.reuseWriter` grows its internal `MemorySegment` list to 
accommodate large records (e.g. 100MB+), but `BinaryRowWriter.reset()` only 
resets the cursor without releasing the oversized segments. Since 
`InternalRowSerializer.serialize()` can exit via `EOFException` (a normal 
signal when the sort buffer is full), the bloated buffer is never released.
   
   **2. Merge path — `BinaryRowSerializer.deserialize(reuse)` only grows, never 
shrinks**
   
   During external merge sort, each merge channel holds a `BinaryRow` reuse 
instance. When a large record is deserialized, the backing `MemorySegment` 
grows to fit it. Subsequent small records reuse the oversized buffer. With 
`max-num-file-handles` (default 128) merge channels, each retaining a 100MB+ 
buffer, memory usage explodes.
   
   **3. Compaction read path — `HeapBytesVector.reserveBytes()` integer 
overflow**
   
   `reserveBytes()` computes `newCapacity * 2` using plain multiplication. When 
`newCapacity` exceeds ~1.07 billion bytes, this overflows `Integer.MAX_VALUE`, 
producing a negative or zero value, which causes `Arrays.copyOf()` to throw 
`NegativeArraySizeException` or silently corrupt data.
   
   **4. Parquet write — statistics and page-size-check config not passed 
through**
   
   `RowDataParquetBuilder` does not pass through several important Parquet 
configuration properties:
   
   - **`parquet.statistics.truncate.length`** — controls truncation of min/max 
statistics. Defaults to `Integer.MAX_VALUE`, causing full 100MB+ values to be 
stored in column chunk metadata, ballooning the Parquet footer.
   - **`parquet.columnindex.truncate.length`** — same issue for column index 
entries.
   - **`parquet.page.size.row.check.min`** — minimum row count before checking 
page size. The default (100) means Parquet accumulates 100 rows before the 
first page-size check. For large records, this can cause a single page to 
balloon to several GB before being flushed.
   - **`parquet.page.size.row.check.max`** — maximum row count between 
page-size checks. The default (10000) is far too high for large-record 
workloads, delaying page flushes and causing excessive memory usage.
   
   Without these configs being passed through, users have no way to tune 
Parquet's behavior for large-record scenarios.
   
   ### Anything else?
   
   **Root cause analysis via heap dump:**
   
   | Path | Class | Issue |
   |---|---|---|
   | Sort | `RowHelper` | `reuseWriter` segments grow but never shrink; 
`EOFException` exit path skips cleanup |
   | Merge | `BinaryRowSerializer` | `deserialize(reuse)` only grows backing 
`MemorySegment`, never shrinks |
   | Compaction | `HeapBytesVector` | `newCapacity * 2` overflows 
`Integer.MAX_VALUE` |
   | Parquet write | `RowDataParquetBuilder` | Statistics/column-index truncate 
length and page-size-check row counts not configurable |
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to