Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

via GitHub Sat, 21 Mar 2026 03:13:57 -0700


ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4103001083


   Hi @Dandandan — thanks for this work; the cross-join split for **multiple 
distinct aggregates with no `GROUP BY`** is a strong fit for workloads like 
ClickBench extended.
   
   I’ve been working on a related but **different** pattern: **`GROUP BY` + 
several `COUNT(DISTINCT …)`** in the same aggregate (typical BI). In that 
situation, your rule is not quite applicable, because 
`MultiDistinctToCrossJoin` needs an **empty** `GROUP BY` and **all** aggregates 
to be distinct on different columns.
   
   Here is a concrete example from our internal benchmark suite on an 
`ecommerce_orders` table:
   
   ```sql
   SELECT
       seller_name,
       COUNT(*) as total_orders,
       COUNT(DISTINCT delivery_city) as cities_served,
       COUNT(DISTINCT state) as states_served,
       SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 END) as 
completed_orders,
       SUM(CASE WHEN order_status = 'Cancelled' THEN 1 ELSE 0 END) as 
cancelled_orders,
       ROUND(100.0 * SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 
END) / COUNT(*), 2) as success_rate
   FROM orders_data
   GROUP BY seller_name
   HAVING COUNT(*) > 100
   ORDER BY total_orders DESC
   LIMIT 100;
   ```
   
   This is **not** “global” multi-distinct: it’s **per `seller_name`**, with 
**multiple `COUNT(DISTINCT …)`** plus other aggregates. That’s the class my 
optimizer rule (`MultiDistinctCountRewrite`) targets — rewriting the 
**`COUNT(DISTINCT …)`** pieces into **joinable sub-aggregates aligned on the 
same `GROUP BY` keys**, with correct `NULL` handling where needed. For this 
table with 20m rows, on my M4 machine the times are 0.42s vs ~17s for the 
implementation in #20940
   
   In my opinion, they’re **complementary**: different predicates, different 
plans, and they can **coexist** in the optimizer pipeline (we’d want to 
sanity-check rule order so we don’t double-rewrite the same node).
   
   Happy to align naming, tests, and placement with you and the maintainers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

Reply via email to