Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

via GitHub Mon, 23 Mar 2026 03:23:48 -0700


Dandandan commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4109532126


   
   Yes, I can see it would improve a lot of queries 
   
   > Thanks @Dandandan @xiedeyantu for the discussion.
   > 
   > **Join cost / default behavior:** Agreed that joining on grouping keys 
isn’t free. We’ve added 
**`datafusion.optimizer.enable_multi_distinct_count_rewrite`** with default 
**`false`**, so the rewrite is **opt-in** until we have benchmarks that justify 
turning it on by default. Sessions can enable it explicitly when they want to 
try the plan shape.
   > 
   > **Benchmarks:** We’re aligned that we should measure **latency and 
memory** vs baseline across scenarios (e.g. multiple `COUNT(DISTINCT …)` with 
`GROUP BY`, varying group-key cardinality vs distinct cardinality). We’ll 
follow up with numbers on the PR.
   > 
   > **GroupsAccumulator / execution-layer improvements:** That work is 
complementary: better distinct accumulators help **how** each aggregate runs; 
this rule changes **logical plan shape** when several large distincts share one 
aggregate. Both can coexist on the roadmap.
   > 
   > **Tests:** We added optimizer + SQL integration coverage (including cases 
with `lower`/`CAST` inside `COUNT(DISTINCT …)` and a test that the rule is a 
no-op when the config is off). Happy to iterate on naming and placement with 
maintainers.
   > 
   > > > I think a rewrite like this might be useful, but I think it can also 
hurt performance because of the join on grouping keys. So I think it needs to 
have a config value (off by default) or when enabled some benchmarks showing 
that it is better in large majority of the cases.
   > > > I am also wondering if mostly for memory usage a `GroupsAccumulator` 
for distinct count / sum might give similar/more improvements.
   > > 
   > > 
   > > @Dandandan Thank you for the explanation. It’s true that this would add 
a hash join, but if aggregation can be performed in parallel, there might be 
advantages in scenarios with two or more COUNT(DISTINCT) operations. I agree to 
run performance tests across multiple scenarios to evaluate the actual results.
   
   Sounds good - agreed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

Reply via email to