[I] Sort ClickBench data using 4GB on standard laptop (spilling) [datafusion]

via GitHub Mon, 08 Dec 2025 10:04:36 -0800


alamb opened a new issue, #19216:
URL: https://github.com/apache/datafusion/issues/19216


   ### Is your feature request related to a problem or challenge?
   
   While working with @zhuqi-lucas on 
https://github.com/apache/datafusion/pull/19042 we noticed it is not possible 
to sort the hits.parquet dataset 
   
   
   Get the data
   ```shell
   ./benchmarks/bench.sh data clickbench_1
   ```
   
   Try to resort it using 4G of memory (on a 20 core Mac M3 laptop):
   
   ```shell
   datafusion-cli -m 4G -c "COPY (SELECT * FROM 'benchmarks/data/hits.parquet' 
ORDER BY \"EventTime\") TO 'hits_sorted.parquet' STORED AS PARQUET;"
   ```
   
   Results in 
   ```
   DataFusion CLI v51.0.0
   Error: Not enough memory to continue external sort. Consider increasing the 
memory limit, or decreasing sort_spill_reservation_bytes
   caused by
   Resources exhausted: Additional allocation failed for ExternalSorter[7] with 
top memory consumers (across reservations) as:
     ExternalSorterMerge[4]#11(can spill: false) consumed 883.8 MB, peak 883.8 
MB,
     ExternalSorterMerge[1]#5(can spill: false) consumed 812.6 MB, peak 812.6 
MB,
     ExternalSorterMerge[9]#21(can spill: false) consumed 764.8 MB, peak 764.8 
MB.
   Error: Failed to allocate additional 13.7 MB for ExternalSorter[7] with 0.0 
B already allocated for this reservation - 1088.1 KB remain available for the 
total pool
   ```
   
   As @2010YOUY01  has documented in 
https://datafusion.apache.org/user-guide/configs.html#memory-limited-queries,  
this query does run to completion with fewer target partitions for example 1:
   
   ```sql
   SET datafusion.execution.target_partitions = 1;
   ```
   
   ```shell
   datafusion-cli -m 4G -c "SET datafusion.execution.target_partitions = 1; 
COPY (SELECT * FROM 'benchmarks/data/hits.parquet' ORDER BY \"EventTime\") TO 
'hits_sorted.parquet' STORED AS PARQUET;"
   ```
   
   ### Describe the solution you'd like
   
   I would like DataFusion to be able to complete such queries with a 
reasonable amount of RAM without having to tune the target partitions
   
   
   ### Describe alternatives you've considered
   
   Maybe there could be be some "rule of thumb" for the required resources -- 
for example, perhaps we could make sure queries run with 1 GB of RAM per core 
(and adjust the batch size / target partitioning automatically if needed)
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Sort ClickBench data using 4GB on standard laptop (spilling) [datafusion]

Reply via email to