Kontinuation opened a new pull request, #14644:
URL: https://github.com/apache/datafusion/pull/14644

   ## Which issue does this PR close?
   
   - Hopefully it closes #10073, but it is still an incomplete solution.
   
   ## Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.  
   -->
   
   I had a hard time making DataFusion Comet work on cloud instances with 4GB 
memory per CPU core, partially because DataFusion is very likely to allocates 
more memory than reserved and run into OOM, or run into various kinds of memory 
reservation failures. In cases when the partition to process is larger than 
available memory, we expect spilling to happen to run the query to completion, 
but got tons of failures instead. We found operators involving SortExec such as 
sort-merge join triggers the aforementioned problems frequently.
   
   #10073 reports that SortExec may allocate 2X memory than it reserves ("the 
second problem"), and we found that it contributed to most of the OOM cases we 
encountered when using Comet. We have also found several other problems related 
to SortExec that are critical for our memory-limited use cases, and this PR 
tries to accommodate some of them. 
   
   ## What changes are included in this PR?
   
   This PR contains several fixes:
   
   1. Don't `try_collect` the result of merging all at once. We consume the 
merged batches one after another and reserve memory for each batch. Once the 
reservation fails we switch to "spill mode" and write all future batches into 
the spill file. This resolves the 2X memory allocation problem ("the second 
problem") reported by #10073.
   2. `shrink_to_fit` every sorted batches reduce the memory footprint of 
sorted batches, otherwise sorted string arrays may take 2X the original space 
in the worst case, due to exponential growth of `MutableBuffer` for storing 
variable length binary values. `shrink_to_fit` is a no-op for primitive-type 
columns returned by `take_arrays` since they already has the right capacity, 
and benchmarking showed no significant performance regression, so I think it is 
a good change. This resolves "the first problem" reported by #10073.
   3. Reserves more memory for ingested batches to leave some room for merging. 
This PR reserves 2X memory for each batch, this works for most of the queries 
in sort-tpch benchmark (all except Q5 and Q8). User still have to configure 
`sort_spill_reservation_bytes` when memory reserved is not big enough for 
merging. I don't think it is a good change but this is the only solution I can 
think of to compensate for the extra memory usage for the row representation of 
sorted columns.
   
   ## Are these changes tested?
   
   Yes. It passes all the tests.
   
   ## Are there any user-facing changes?
   
   Uses may find that sort operator is more likely to spill when running with 
memory constraints. The old configurations they had to make sort operator work 
may not be optimal after applying this PR. For instance, user may configure a 
super large `sort_spill_reservation_bytes` to make merging work, but this PR 
reduces the optimal value of `sort_spill_reservation_bytes` for the same 
workload.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to