alamb commented on code in PR #17069: URL: https://github.com/apache/datafusion/pull/17069#discussion_r2261433216
########## dev/update_config_docs.sh: ########## @@ -149,6 +149,37 @@ SET datafusion.execution.target_partitions = '1'; [`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html +## Memory-limited Queries + +When executing a memory-consuming query under a tight memory limit, DataFusion will +attempt to use a separate execution path for the query, and the intermediate results +will be spilled to disk. Review Comment: ```suggestion When executing a memory-consuming query under a tight memory limit, DataFusion will spill intermediate results to disk. ``` ########## dev/update_config_docs.sh: ########## @@ -149,6 +149,37 @@ SET datafusion.execution.target_partitions = '1'; [`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html +## Memory-limited Queries + +When executing a memory-consuming query under a tight memory limit, DataFusion will +attempt to use a separate execution path for the query, and the intermediate results +will be spilled to disk. + +If the [`FairSpillPool`] is used, partitions will attempt to divide the available +memory evenly. If the partition count `datafusion.execution.target_partitions` +is set too high, each partition will be allocated less memory, and the out-of-core +execution path will trigger more spills and possibly slow down the query. + +Additionally, all of the external join, aggregate, and sort operations now rely on the external +sort implementation, which sorts the buffered data first, then spills, and in the +final stage reads spills back incrementally and performs a sort-preserving merge. If the +`datafusion.execution.batch_size` is set too large, in the sort-preserving merge +phase the executor can only merge a small number of spilled sorted runs, which +causes more re-spills to happen. As a result, setting `batch_size` to a smaller +value can help reduce the number of spills. Review Comment: ```suggestion Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches. The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting can help reduce the number of subsequent spills required. ``` ########## dev/update_config_docs.sh: ########## @@ -149,6 +149,37 @@ SET datafusion.execution.target_partitions = '1'; [`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html +## Memory-limited Queries + +When executing a memory-consuming query under a tight memory limit, DataFusion will +attempt to use a separate execution path for the query, and the intermediate results +will be spilled to disk. + +If the [`FairSpillPool`] is used, partitions will attempt to divide the available +memory evenly. If the partition count `datafusion.execution.target_partitions` +is set too high, each partition will be allocated less memory, and the out-of-core +execution path will trigger more spills and possibly slow down the query. Review Comment: ```suggestion When the [`FairSpillPool`] is used, memory is divided evenly among partitions. The higher the value of `datafusion.execution.target_partitions` the less memory is allocated to each partition, and the out-of-core execution path may trigger more frequently, slowing down execution. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org