LsomeYeah commented on PR #7621:
URL: https://github.com/apache/paimon/pull/7621#issuecomment-4517251218
Thanks for the detailed fix! The heap-dump-based root cause breakdown is
very clear. The HeapBytesVector overflow fix and Parquet config pass-through
both look great.
We have a few concerns about the RowHelper / BinaryRowSerializer parts
though — would love to hear your thoughts:
1. 4MB threshold lacks a basis. Data patterns vary a lot across workloads —
a table with uniform 1MB records, one with mixed 1KB/100MB records, and one
with sustained 5MB records all behave very differently around a fixed 4MB
cutoff.
2. Reuse mechanism effectively defeated near the threshold. For sustained
5–10MB records, every call triggers release-and-rebuild or
shrink-and-reallocate, paying allocation cost on every record. This defeats the
purpose of these reuse buffers.
3. Hot-path behavior change with no opt-out. The finally { resetIfTooLarge()
} and the new shrink branch change default behavior for every user on upgrade,
not only those experiencing OOM.
4. Prior art. Worth checking how Spark's UnsafeExternalSorter or similar
engines handle reuse-buffer growth — there may be a more principled heuristic
(hysteresis, memory-pressure-based release) that avoids the thrash pattern.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]