zhuqi-lucas commented on PR #14644: URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2664713258
Hi @westonpace , i think the problem is we need to setting the partition count and also to increase the memory limit also for your case: ``` 1. setting the partition count to 1: .with_target_partitions(1) 2. increasing the memory limit for example to 300MB ``` Details here: https://github.com/apache/datafusion/pull/14644#issuecomment-2660748423 1. The DataSourceExec may have many partitions, and each SortExec on that partition will only get a fair share of the 100MB pool, so each partition won't get enough memory to operate. This is still the case even with worker_threads = 1. If you still want to sort 4.2GB parquet file using 100MBs of memory, you can set .with_target_partitions(1) in your session config. 2. 100MB is not enough for the final merging with spill-reads. There will be roughly 200 spill files generated after ingesting all the batches, the size of a typical batch for this workload is 352380 bytes. The memory needed for merging will be 200 * (352380 bytes) * 2 > 100MB. Merging phase is unspillable so it requires a minimum amount of memory to operate. Raising the memory limit to 200MB will work for this particular workload. And for future work we can have a complete spill solution tracking here: https://github.com/apache/datafusion/issues/14692 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org