Re: [PR] [common][format] Fix OOM when writing/compacting table with large records [paimon]

via GitHub Fri, 22 May 2026 02:04:04 -0700


LsomeYeah commented on PR #7621:
URL: https://github.com/apache/paimon/pull/7621#issuecomment-4517251218


   Thanks for the detailed fix! The heap-dump-based root cause breakdown is 
very clear. The HeapBytesVector overflow fix and Parquet config pass-through 
both look great.
   
   We have a few concerns about the RowHelper / BinaryRowSerializer parts 
though — would love to hear your thoughts:
   
   1. 4MB threshold lacks a basis. Data patterns vary a lot across workloads — 
a table with uniform 1MB records, one with mixed 1KB/100MB records, and one 
with sustained 5MB records all behave very differently around a fixed 4MB 
cutoff. 
   2. Reuse mechanism effectively defeated near the threshold. For sustained 
5–10MB records, every call triggers release-and-rebuild or 
shrink-and-reallocate, paying allocation cost on every record. This defeats the 
purpose of these reuse buffers.
   3. Hot-path behavior change with no opt-out. The finally { resetIfTooLarge() 
} and the new shrink branch change default behavior for every user on upgrade, 
not only those experiencing OOM.
   4. Prior art. Worth checking how Spark's UnsafeExternalSorter or similar 
engines handle reuse-buffer growth — there may be a more principled heuristic 
(hysteresis, memory-pressure-based release) that avoids the thrash pattern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [common][format] Fix OOM when writing/compacting table with large records [paimon]

Reply via email to