Kontinuation opened a new pull request, #14644: URL: https://github.com/apache/datafusion/pull/14644
## Which issue does this PR close? - Hopefully it closes #10073, but it is still an incomplete solution. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> I had a hard time making DataFusion Comet work on cloud instances with 4GB memory per CPU core, partially because DataFusion is very likely to allocates more memory than reserved and run into OOM, or run into various kinds of memory reservation failures. In cases when the partition to process is larger than available memory, we expect spilling to happen to run the query to completion, but got tons of failures instead. We found operators involving SortExec such as sort-merge join triggers the aforementioned problems frequently. #10073 reports that SortExec may allocate 2X memory than it reserves ("the second problem"), and we found that it contributed to most of the OOM cases we encountered when using Comet. We have also found several other problems related to SortExec that are critical for our memory-limited use cases, and this PR tries to accommodate some of them. ## What changes are included in this PR? This PR contains several fixes: 1. Don't `try_collect` the result of merging all at once. We consume the merged batches one after another and reserve memory for each batch. Once the reservation fails we switch to "spill mode" and write all future batches into the spill file. This resolves the 2X memory allocation problem ("the second problem") reported by #10073. 2. `shrink_to_fit` every sorted batches reduce the memory footprint of sorted batches, otherwise sorted string arrays may take 2X the original space in the worst case, due to exponential growth of `MutableBuffer` for storing variable length binary values. `shrink_to_fit` is a no-op for primitive-type columns returned by `take_arrays` since they already has the right capacity, and benchmarking showed no significant performance regression, so I think it is a good change. This resolves "the first problem" reported by #10073. 3. Reserves more memory for ingested batches to leave some room for merging. This PR reserves 2X memory for each batch, this works for most of the queries in sort-tpch benchmark (all except Q5 and Q8). User still have to configure `sort_spill_reservation_bytes` when memory reserved is not big enough for merging. I don't think it is a good change but this is the only solution I can think of to compensate for the extra memory usage for the row representation of sorted columns. ## Are these changes tested? Yes. It passes all the tests. ## Are there any user-facing changes? Uses may find that sort operator is more likely to spill when running with memory constraints. The old configurations they had to make sort operator work may not be optimal after applying this PR. For instance, user may configure a super large `sort_spill_reservation_bytes` to make merging work, but this PR reduces the optimal value of `sort_spill_reservation_bytes` for the same workload. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org