Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

via GitHub Sun, 22 Mar 2026 05:33:03 -0700


xiedeyantu commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4106184220


   > I think a rewrite like this might be useful, but I think it can also hurt 
performance because of the join on grouping keys. So I think it needs to have a 
config value (off by default) or when enabled some benchmarks showing that it 
is better in large majority of the cases.
   > 
   > I am also wondering if mostly for memory usage a `GroupsAccumulator` for 
distinct count / sum might give similar/more improvements.
   
   @Dandandan Thank you for the explanation. It’s true that this would add a 
hash join, but if aggregation can be performed in parallel, there might be 
advantages in scenarios with two or more COUNT(DISTINCT) operations. I agree to 
run performance tests across multiple scenarios to evaluate the actual results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

Reply via email to