2010YOUY01 commented on PR #17029: URL: https://github.com/apache/datafusion/pull/17029#issuecomment-3167406930
> > How much is the difference for the fuzz tests I added that check memory constrained envs? as it only tests couple of simple columns that are easier to reason about. > > I also did some additional debugging to understand why SortPreservingMergeStream ends up using more memory than the pre-reserved amount. The root cause I identified is as follows: > > Except for slight discrepancies due to vector allocation overhead, I found several key sources of memory underestimation: > > 1. In-memory stream not accounted for SPM: > > When performing SPM (SortPreservingMerge) over both spill files and in-memory streams, we only reserve memory using `get_reserved_byte_for_record_batch_size(spill.max_record_batch_memory * buffer_len) * (per spill)`. However, if there are 2 spill streams and 1 in-memory stream, the reservation for the in-memory stream is not considered at all. > > 2. Incorrect logic in `get_reserved_byte_for_record_batch_size`: > > The estimation was based on a fixed 2× multiplier, without considering difference in sort key & sort payload columns and data type. In reality, this varies significantly and the current logic often under(/over)estimates. This is a known issue(#14748) and I’m actively working on it. > > 3. SPM buffers at most 2 (buffer_len * (batch and cursor)) per stream: > > Looking at the implementation, SPM can buffer both the previous and current (cursor, batch) for each stream simultaneously. See > > https://github.com/apache/datafusion/blob/f9efba0e2c4c9c272062b5d48f6a44830b8f08ab/datafusion/physical-plan/src/sorts/builder.rs#L64 > > and > https://github.com/apache/datafusion/blob/f9efba0e2c4c9c272062b5d48f6a44830b8f08ab/datafusion/physical-plan/src/sorts/merge.rs#L333 > > That means, in the worst case, SPM can use up to **2 ×** `get_reserved_byte_for_record_batch_size(spill.max_record_batch_memory * buffer_len) * (per stream)` So I think we should double the reservation per stream to be safe. For point 1: I vaguely remember in multi level merge, there is a logic to re-spill in-memory batches before the final merge, so that we don't have to special handlings for the mixed in-mem + spills case 🤔 If I’m not remembering it correctly, or we have missed some edges cases, we should do it (before the final merge, spill all in-mems first) for simplicity now. For point 2: I was expecting this should better be done after https://github.com/apache/datafusion/pull/15380, but it seems this optimization got stuck, I'll look into this issue in the next few days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
