Re: [PR] Cascaded spill merge and re-spill [datafusion]

via GitHub Tue, 15 Apr 2025 21:34:16 -0700


2010YOUY01 commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2808240189


   > > Benchmark results: (I think there is no significant regression for an 
extra round of re-spill, if it's running on a machine with fast SSDs)
   > 
   > It seems to me that there is a 30% regression in performance compared to 
main when there is enough memory, right?
   > 
   > > #### Result
   > > Main (1.2G):
   > > Q7 avg time: 8680.47 ms
   > > PR (1.2G):
   > > Q7 avg time: 11808.71 ms
   > 
   > But this PR is significantly better that it can complete with only 500M of 
memory
   > 
   > Is there any way to regain the performance (maybe by choosing how many 
merge phases to do based on available memory rather than a fixed size)?
   
   If we manually set this max merge degree to a larger value, the merging 
behavior will be equivalent to the current implementation:
   ```
   Q7 iteration 0 took 7242.8 ms and returned 59986052 rows
   Q7 iteration 1 took 7203.4 ms and returned 59986052 rows
   Q7 iteration 2 took 9812.6 ms and returned 59986052 rows
   Q7 avg time: 8086.24 ms
   ```
   
   I think auto-tuning is possible, and is also a good future optimization to 
do, but it requires some work to extend the memory pool to estimate available 
memory for current reservation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Cascaded spill merge and re-spill [datafusion]

Reply via email to