zhuqi-lucas commented on PR #14644:
URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2660771854

   > > Thank you @kazuyukitanimura for the PR, i applied the PR try to fix the 
testing, but the above testing is still failed for me, i am not sure if i am 
missing something.
   > 
   > There are 2 problems:
   > 
   > 1. The `DataSourceExec` may have many partitions, and each SortExec on 
that partition will only get a fair share of the 100MB pool, so each partition 
won't get enough memory to operate. This is still the case even with 
`worker_threads = 1`. If you still want to sort 4.2GB parquet file using 100MBs 
of memory, you can set `.with_target_partitions(1)` in your session config.
   > 2. 100MB is not enough for the final merging with spill-reads. There will 
be roughly 200 spill files generated after ingesting all the batches, the size 
of a typical batch for this workload is 352380 bytes. The memory needed for 
merging will be 200 * (352380 bytes) * 2 > 100MB. Merging phase is unspillable 
so it requires a minimum amount of memory to operate. Raising the memory limit 
to 200MB will work for this particular workload.
   > 
   > One possible fix for problem 2 is to use a smaller batch size when writing 
batches to spill files, so that the unspillable memory required for the final 
spill-read merging will be smaller. Or we simply leave this problem as is and 
requires the user to raise the memory limit.
   
   
   **Updated, it works after change to 1 partition and increase the memory 
limit.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to