zhuqi-lucas commented on PR #14644:
URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2664713258

   Hi @westonpace , i think the problem is we need to setting the partition 
count and also to increase the memory limit also for your case:
   
   
   ```
   1. setting the partition count to 1:
    .with_target_partitions(1)
   
   2. increasing the memory limit for example to 300MB
   ```
   
   
   Details here:
   
   https://github.com/apache/datafusion/pull/14644#issuecomment-2660748423
   
   
   1. The DataSourceExec may have many partitions, and each SortExec on that 
partition will only get a fair share of the 100MB pool, so each partition won't 
get enough memory to operate. This is still the case even with worker_threads = 
1. If you still want to sort 4.2GB parquet file using 100MBs of memory, you can 
set .with_target_partitions(1) in your session config.
   
   
   2. 100MB is not enough for the final merging with spill-reads. There will be 
roughly 200 spill files generated after ingesting all the batches, the size of 
a typical batch for this workload is 352380 bytes. The memory needed for 
merging will be 200 * (352380 bytes) * 2 > 100MB. Merging phase is unspillable 
so it requires a minimum amount of memory to operate. Raising the memory limit 
to 200MB will work for this particular workload.
   
   
   And for future work we can have a complete spill solution tracking here:
   
   https://github.com/apache/datafusion/issues/14692
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to