alamb opened a new issue, #19216: URL: https://github.com/apache/datafusion/issues/19216
### Is your feature request related to a problem or challenge? While working with @zhuqi-lucas on https://github.com/apache/datafusion/pull/19042 we noticed it is not possible to sort the hits.parquet dataset Get the data ```shell ./benchmarks/bench.sh data clickbench_1 ``` Try to resort it using 4G of memory (on a 20 core Mac M3 laptop): ```shell datafusion-cli -m 4G -c "COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;" ``` Results in ``` DataFusion CLI v51.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit, or decreasing sort_spill_reservation_bytes caused by Resources exhausted: Additional allocation failed for ExternalSorter[7] with top memory consumers (across reservations) as: ExternalSorterMerge[4]#11(can spill: false) consumed 883.8 MB, peak 883.8 MB, ExternalSorterMerge[1]#5(can spill: false) consumed 812.6 MB, peak 812.6 MB, ExternalSorterMerge[9]#21(can spill: false) consumed 764.8 MB, peak 764.8 MB. Error: Failed to allocate additional 13.7 MB for ExternalSorter[7] with 0.0 B already allocated for this reservation - 1088.1 KB remain available for the total pool ``` As @2010YOUY01 has documented in https://datafusion.apache.org/user-guide/configs.html#memory-limited-queries, this query does run to completion with fewer target partitions for example 1: ```sql SET datafusion.execution.target_partitions = 1; ``` ```shell datafusion-cli -m 4G -c "SET datafusion.execution.target_partitions = 1; COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;" ``` ### Describe the solution you'd like I would like DataFusion to be able to complete such queries with a reasonable amount of RAM without having to tune the target partitions ### Describe alternatives you've considered Maybe there could be be some "rule of thumb" for the required resources -- for example, perhaps we could make sure queries run with 1 GB of RAM per core (and adjust the batch size / target partitioning automatically if needed) ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
