ydgandhi commented on PR #21088: URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4108104692
Thanks @Dandandan @xiedeyantu for the discussion. **Join cost / default behavior:** Agreed that joining on grouping keys isn’t free. We’ve added **`datafusion.optimizer.enable_multi_distinct_count_rewrite`** with default **`false`**, so the rewrite is **opt-in** until we have benchmarks that justify turning it on by default. Sessions can enable it explicitly when they want to try the plan shape. **Benchmarks:** We’re aligned that we should measure **latency and memory** vs baseline across scenarios (e.g. multiple `COUNT(DISTINCT …)` with `GROUP BY`, varying group-key cardinality vs distinct cardinality). We’ll follow up with numbers on the PR. **GroupsAccumulator / execution-layer improvements:** That work is complementary: better distinct accumulators help **how** each aggregate runs; this rule changes **logical plan shape** when several large distincts share one aggregate. Both can coexist on the roadmap. **Tests:** We added optimizer + SQL integration coverage (including cases with `lower`/`CAST` inside `COUNT(DISTINCT …)` and a test that the rule is a no-op when the config is off). Happy to iterate on naming and placement with maintainers. > > I think a rewrite like this might be useful, but I think it can also hurt performance because of the join on grouping keys. So I think it needs to have a config value (off by default) or when enabled some benchmarks showing that it is better in large majority of the cases. > > I am also wondering if mostly for memory usage a `GroupsAccumulator` for distinct count / sum might give similar/more improvements. > > @Dandandan Thank you for the explanation. It’s true that this would add a hash join, but if aggregation can be performed in parallel, there might be advantages in scenarios with two or more COUNT(DISTINCT) operations. I agree to run performance tests across multiple scenarios to evaluate the actual results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
