alamb commented on code in PR #21426:
URL: https://github.com/apache/datafusion/pull/21426#discussion_r3045025814
##########
datafusion/common/src/config.rs:
##########
@@ -557,6 +557,19 @@ config_namespace! {
/// batches and merged.
pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
+ /// Maximum buffer capacity (in bytes) per partition for BufferExec
+ /// inserted during sort pushdown optimization.
+ ///
+ /// When PushdownSort eliminates a SortExec under
SortPreservingMergeExec,
+ /// a BufferExec is inserted to replace SortExec's buffering role. This
+ /// prevents I/O stalls by allowing the scan to run ahead of the merge.
+ ///
+ /// This uses strictly less memory than the SortExec it replaces (which
+ /// buffers the entire partition). The buffer respects the global
memory
+ /// pool limit. Setting this to a large value is safe — actual memory
+ /// usage is bounded by partition size and global memory limits.
+ pub sort_pushdown_buffer_capacity: usize, default = 1024 * 1024 * 1024
Review Comment:
This PR increases the size buffer because
> 64MB was too small for wide-row scans (16-column TPC-H SELECT * queries
showed I/O stalls)
To be clear to anyone reading this, what will be on main is still better
than 53.0.0 because prior to https://github.com/apache/datafusion/pull/21182
DataFusion would have sorted the entire thing (rather than just buffering it)
A fixed size like this is likely to buffer more than required for narrow
cases
I suspect a better solution than a fixed size buffer would be some
calculation based on the actual size of the data (e.g. the number of rows to
buffer). However, that is tricky to compute / constrain memory when large
strings are involved.
We probably would need to have both a row limit and a memory cap and pick
the smaller of the two.
We can perhaps do this as a follow on issue/PR
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]