Re: [PR] Docs: Add Tuning Guide for larger-than-memory queries [datafusion]

via GitHub Thu, 07 Aug 2025 14:36:45 -0700


alamb commented on code in PR #17069:
URL: https://github.com/apache/datafusion/pull/17069#discussion_r2261433216



##########
dev/update_config_docs.sh:
##########
@@ -149,6 +149,37 @@ SET datafusion.execution.target_partitions = '1';
 
 [`ListingTable`]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
 
+## Memory-limited Queries
+
+When executing a memory-consuming query under a tight memory limit, DataFusion 
will
+attempt to use a separate execution path for the query, and the intermediate 
results
+will be spilled to disk.

Review Comment:
   ```suggestion
   When executing a memory-consuming query under a tight memory limit, 
DataFusion 
   will spill intermediate results to disk.
   ```



##########
dev/update_config_docs.sh:
##########
@@ -149,6 +149,37 @@ SET datafusion.execution.target_partitions = '1';
 
 [`ListingTable`]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
 
+## Memory-limited Queries
+
+When executing a memory-consuming query under a tight memory limit, DataFusion 
will
+attempt to use a separate execution path for the query, and the intermediate 
results
+will be spilled to disk.
+
+If the [`FairSpillPool`] is used, partitions will attempt to divide the 
available
+memory evenly. If the partition count `datafusion.execution.target_partitions`
+is set too high, each partition will be allocated less memory, and the 
out-of-core
+execution path will trigger more spills and possibly slow down the query.
+
+Additionally, all of the external join, aggregate, and sort operations now 
rely on the external
+sort implementation, which sorts the buffered data first, then spills, and in 
the
+final stage reads spills back incrementally and performs a sort-preserving 
merge. If the
+`datafusion.execution.batch_size` is set too large, in the sort-preserving 
merge
+phase the executor can only merge a small number of spilled sorted runs, which
+causes more re-spills to happen. As a result, setting `batch_size` to a smaller
+value can help reduce the number of spills.

Review Comment:
   ```suggestion
   Additionally, while spilling, data is read back in 
`datafusion.execution.batch_size` size batches.
   The larger this value, the fewer spilled sorted runs can be merged. 
Decreasing this setting
   can help reduce the number of subsequent spills required. 
   ```



##########
dev/update_config_docs.sh:
##########
@@ -149,6 +149,37 @@ SET datafusion.execution.target_partitions = '1';
 
 [`ListingTable`]: 
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
 
+## Memory-limited Queries
+
+When executing a memory-consuming query under a tight memory limit, DataFusion 
will
+attempt to use a separate execution path for the query, and the intermediate 
results
+will be spilled to disk.
+
+If the [`FairSpillPool`] is used, partitions will attempt to divide the 
available
+memory evenly. If the partition count `datafusion.execution.target_partitions`
+is set too high, each partition will be allocated less memory, and the 
out-of-core
+execution path will trigger more spills and possibly slow down the query.

Review Comment:
   ```suggestion
   When the [`FairSpillPool`] is used, memory is divided evenly among 
partitions. 
   The higher the value of `datafusion.execution.target_partitions` 
   the less memory is allocated to each partition, and the out-of-core
   execution path may trigger more frequently, slowing down execution.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Docs: Add Tuning Guide for larger-than-memory queries [datafusion]

Reply via email to