Re: [PR] Validate the memory consumption in SPM created by multi level merge [datafusion]

via GitHub Fri, 08 Aug 2025 03:35:16 -0700


2010YOUY01 commented on PR #17029:
URL: https://github.com/apache/datafusion/pull/17029#issuecomment-3167406930


   > > How much is the difference for the fuzz tests I added that check memory 
constrained envs? as it only tests couple of simple columns that are easier to 
reason about.
   > 
   > I also did some additional debugging to understand why 
SortPreservingMergeStream ends up using more memory than the pre-reserved 
amount. The root cause I identified is as follows:
   > 
   > Except for slight discrepancies due to vector allocation overhead, I found 
several key sources of memory underestimation:
   > 
   > 1. In-memory stream not accounted for SPM:
   > 
   > When performing SPM (SortPreservingMerge) over both spill files and 
in-memory streams, we only reserve memory using 
`get_reserved_byte_for_record_batch_size(spill.max_record_batch_memory * 
buffer_len) * (per spill)`. However, if there are 2 spill streams and 1 
in-memory stream, the reservation for the in-memory stream is not considered at 
all.
   > 
   > 2. Incorrect logic in `get_reserved_byte_for_record_batch_size`:
   > 
   > The estimation was based on a fixed 2× multiplier, without considering 
difference in sort key & sort payload columns and data type. In reality, this 
varies significantly and the current logic often under(/over)estimates. This is 
a known issue(#14748) and I’m actively working on it.
   > 
   > 3. SPM buffers at most 2 (buffer_len * (batch and cursor)) per stream:
   > 
   > Looking at the implementation, SPM can buffer both the previous and 
current (cursor, batch) for each stream simultaneously. See
   > 
   > 
https://github.com/apache/datafusion/blob/f9efba0e2c4c9c272062b5d48f6a44830b8f08ab/datafusion/physical-plan/src/sorts/builder.rs#L64
   > 
   > and
   > 
https://github.com/apache/datafusion/blob/f9efba0e2c4c9c272062b5d48f6a44830b8f08ab/datafusion/physical-plan/src/sorts/merge.rs#L333
   > 
   > That means, in the worst case, SPM can use up to **2 ×** 
`get_reserved_byte_for_record_batch_size(spill.max_record_batch_memory * 
buffer_len) * (per stream)` So I think we should double the reservation per 
stream to be safe.
   
   For point 1: I vaguely remember in multi level merge, there is a logic to 
re-spill in-memory batches before the final merge, so that we don't have to 
special handlings for the mixed in-mem + spills case 🤔 If I’m not remembering 
it correctly, or we have missed some edges cases, we should do it (before the 
final merge, spill all in-mems first) for simplicity now.
   
   For point 2: I was expecting this should better be done after 
https://github.com/apache/datafusion/pull/15380, but it seems this optimization 
got stuck, I'll look into this issue in the next few days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Validate the memory consumption in SPM created by multi level merge [datafusion]

Reply via email to