yugan95 commented on PR #7621:
URL: https://github.com/apache/paimon/pull/7621#issuecomment-4533293854
@JingsongLi Thanks for the detailed per-fix comments! Addressing each point:
**1. RowHelper — `getSegments().size()` semantics**
Good question. `AbstractBinaryWriter.getSegments()` returns a single
`MemorySegment`
(not a list/array — the method name is slightly misleading), and
`MemorySegment.size()`
returns the byte size of the underlying buffer. So the comparison
`reuseWriter.getSegments().size() > 4MB` correctly checks byte count, and
the threshold
matches.
I've also added a hysteresis guard per @LsomeYeah's feedback:
`resetIfTooLarge()` now
additionally checks `reuseRow.getSizeInBytes() < threshold`, so it only
releases when
the last written record was small. This avoids thrashing for sustained
large-record
workloads while still reclaiming memory when the workload transitions back.
**Configurability:** I think a fixed 4MB is sufficient for the initial
fix. If operators
with unusual memory profiles need tuning, we can add a system property in
a follow-up.
Keeping the initial change minimal reduces merge risk.
**2. BinaryRowSerializer shrink** — now improved with hysteresis. The
`length <
threshold` guard means consistent large-record workloads reuse the buffer
without
reallocation, addressing the "allocating every time" concern you noted.
**3. HeapBytesVector** — addressed in commit f1954ffd.
**4. Parquet config pass-through** — moved to #7956.
**5. Performance overhead** — the check is 1 null check + 2 int
comparisons, all on hot
fields already in cache. For small-record workloads (buffer < 4MB),
`resetIfTooLarge()`
short-circuits at the first size check — no branch taken, no allocation.
This is
comparable to an array bounds check and will be inlined by the JIT. Happy
to run a JMH
benchmark if the team wants hard numbers, but I don't expect a measurable
delta.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]