ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4103001083
Hi @Dandandan — thanks for this work; the cross-join split for **multiple
distinct aggregates with no `GROUP BY`** is a strong fit for workloads like
ClickBench extended.
I’ve been working on a related but **different** pattern: **`GROUP BY` +
several `COUNT(DISTINCT …)`** in the same aggregate (typical BI). In that
situation, your rule is not quite applicable, because
`MultiDistinctToCrossJoin` needs an **empty** `GROUP BY` and **all** aggregates
to be distinct on different columns.
Here is a concrete example from our internal benchmark suite on an
`ecommerce_orders` table:
```sql
SELECT
seller_name,
COUNT(*) as total_orders,
COUNT(DISTINCT delivery_city) as cities_served,
COUNT(DISTINCT state) as states_served,
SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0 END) as
completed_orders,
SUM(CASE WHEN order_status = 'Cancelled' THEN 1 ELSE 0 END) as
cancelled_orders,
ROUND(100.0 * SUM(CASE WHEN order_status = 'Completed' THEN 1 ELSE 0
END) / COUNT(*), 2) as success_rate
FROM orders_data
GROUP BY seller_name
HAVING COUNT(*) > 100
ORDER BY total_orders DESC
LIMIT 100;
```
This is **not** “global” multi-distinct: it’s **per `seller_name`**, with
**multiple `COUNT(DISTINCT …)`** plus other aggregates. That’s the class my
optimizer rule (`MultiDistinctCountRewrite`) targets — rewriting the
**`COUNT(DISTINCT …)`** pieces into **joinable sub-aggregates aligned on the
same `GROUP BY` keys**, with correct `NULL` handling where needed. For this
table with 20m rows, on my M4 machine the times are 0.42s vs ~17s for the
implementation in #20940
In my opinion, they’re **complementary**: different predicates, different
plans, and they can **coexist** in the optimizer pipeline (we’d want to
sanity-check rule order so we don’t double-rewrite the same node).
Happy to align naming, tests, and placement with you and the maintainers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]