Re: [I] [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts [datafusion]

via GitHub Fri, 04 Apr 2025 11:58:11 -0700


rluvaton commented on issue #15271:
URL: https://github.com/apache/datafusion/issues/15271#issuecomment-2773463974


   I saw while debugging some performance issue in `AggregateExec` I see that 
we keep all spilled files open (`RefCountedTempFile` as it keep `tempfile` 
which hold `File`).
   
   and also when merging we read at least 1 batch from every spill file:
   
https://github.com/apache/datafusion/blob/73171986166e3f83ba2b5f8e5ac2f85463dadb28/datafusion/physical-plan/src/aggregates/row_hash.rs#L1059-L1062
   
   so If I have a lot of spill files or if every batch is really huge (contains 
very large lists - like result for array_agg on large dataset) we have all of 
this in memory.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts [datafusion]

Reply via email to