ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4115456381
Hi @xiedeyantu update on the benchmarking. The new benchmarks use the
**grouped aggregate only** (no outer `SUM` or other wrapper) so timings reflect
the multi–`COUNT(DISTINCT …)` work directly. Note: Figures below are
**warm-cache** repeats on one machine (relative A/B, not a CI guarantee). Also,
these figures are from running each of the query 10 times and flipping the
rewrite setting on and off using the grouped queries above (no outer `SUM`).
---
## Queries tested — original runs
### Scenario 1: three distincts (grouped)
```sql
SELECT l_suppkey,
COUNT(DISTINCT l_orderkey) AS d1,
COUNT(DISTINCT l_partkey) AS d2,
COUNT(DISTINCT l_linenumber) AS d3
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_suppkey;
```
### Scenario 2: CAST-based distincts (original)
```sql
SELECT l_suppkey,
COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx,
COUNT(DISTINCT CAST(l_discount * 100 AS INT)) AS cy
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_suppkey;
```
### Scenario 3: `lower()` + string distinct (original, low-cardinality group
key)
```sql
SELECT l_returnflag,
COUNT(DISTINCT lower(l_shipmode)) AS c1,
COUNT(DISTINCT l_shipinstruct) AS c2
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_returnflag;
```
---
## Rerun — reviewer-aligned changes
**CAST (Scenario 2):** both distincts use `CAST` on a **column** only (no
`*` expression inside `CAST`):
```sql
SELECT l_suppkey,
COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx,
COUNT(DISTINCT CAST(l_discount AS INT)) AS cy
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_suppkey;
```
**High cardinality (Scenario 3):** same aggregates as before, but **`GROUP
BY l_suppkey`** instead of `l_returnflag`:
```sql
SELECT l_suppkey,
COUNT(DISTINCT lower(l_shipmode)) AS c1,
COUNT(DISTINCT l_shipinstruct) AS c2
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_suppkey;
```
### Results (mean ± sample stddev, **n=10** per cell)
Measured with `/usr/bin/time -l` and `target/release/datafusion-cli` from
the repo root. RSS is **maximum resident set size** from `time -l` (MB here are
decimal megabytes: bytes / 10⁶). Queries are the **reviewer-aligned grouped
aggregates above** (no outer `SUM`).
**wall (s) mean ± stddev**
| Scenario | Rewrite OFF | Rewrite ON | Outcome |
|----------|------------:|-----------:|---------|
| 1) three distincts | 0.2490 ± 0.0099 | 0.1000 ± 0.0000 | Lower mean wall
time with ON |
| 2) cast-based (column-only `CAST`) | 0.1350 ± 0.0053 | 0.0700 ± 0.0000 |
Lower mean wall time with ON |
| 3) lower()+string (`GROUP BY l_suppkey`) | 3.3920 ± 0.6496 | 0.0800 ±
0.0000 | Much lower mean wall time with ON |
**max RSS (MB) mean ± stddev**
| Scenario | Rewrite OFF | Rewrite ON | Outcome |
|----------|------------:|-----------:|---------|
| 1) three distincts | 590.03 ± 10.12 | 722.62 ± 12.70 | Higher mean peak
RSS with ON |
| 2) cast-based (column-only `CAST`) | 331.83 ± 4.56 | 638.49 ± 8.29 |
Higher mean peak RSS with ON |
| 3) lower()+string (`GROUP BY l_suppkey`) | 9755.13 ± 114.94 | 277.66 ±
1.71 | Much lower mean peak RSS with ON |
Row outputs matched OFF vs ON for every scenario (same grouped aggregate
result). Wall time from `time` is low-precision on macOS (e.g. all runs may
round to the same hundredth of a second, so stddev can be 0).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]