2010YOUY01 commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2808240189
> > Benchmark results: (I think there is no significant regression for an extra round of re-spill, if it's running on a machine with fast SSDs) > > It seems to me that there is a 30% regression in performance compared to main when there is enough memory, right? > > > #### Result > > Main (1.2G): > > Q7 avg time: 8680.47 ms > > PR (1.2G): > > Q7 avg time: 11808.71 ms > > But this PR is significantly better that it can complete with only 500M of memory > > Is there any way to regain the performance (maybe by choosing how many merge phases to do based on available memory rather than a fixed size)? If we manually set this max merge degree to a larger value, the merging behavior will be equivalent to the current implementation: ``` Q7 iteration 0 took 7242.8 ms and returned 59986052 rows Q7 iteration 1 took 7203.4 ms and returned 59986052 rows Q7 iteration 2 took 9812.6 ms and returned 59986052 rows Q7 avg time: 8086.24 ms ``` I think auto-tuning is possible, and is also a good future optimization to do, but it requires some work to extend the memory pool to estimate available memory for current reservation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org