Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

via GitHub Mon, 23 Mar 2026 22:15:05 -0700


ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4115456381


   Hi @xiedeyantu update on the benchmarking. The new benchmarks use the 
**grouped aggregate only** (no outer `SUM` or other wrapper) so timings reflect 
the multi–`COUNT(DISTINCT …)` work directly. Note: Figures below are 
**warm-cache** repeats on one machine (relative A/B, not a CI guarantee). Also, 
these figures are from running each of the query 10 times and flipping the 
rewrite setting on and off using the grouped queries above (no outer `SUM`).
   
   ---
   
   ## Queries tested — original runs
   
   ### Scenario 1: three distincts (grouped)
   
   ```sql
   SELECT l_suppkey,
          COUNT(DISTINCT l_orderkey) AS d1,
          COUNT(DISTINCT l_partkey) AS d2,
          COUNT(DISTINCT l_linenumber) AS d3
   FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   GROUP BY l_suppkey;
   ```
   
   ### Scenario 2: CAST-based distincts (original)
   
   ```sql
   SELECT l_suppkey,
          COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx,
          COUNT(DISTINCT CAST(l_discount * 100 AS INT)) AS cy
   FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   GROUP BY l_suppkey;
   ```
   
   ### Scenario 3: `lower()` + string distinct (original, low-cardinality group 
key)
   
   ```sql
   SELECT l_returnflag,
          COUNT(DISTINCT lower(l_shipmode)) AS c1,
          COUNT(DISTINCT l_shipinstruct) AS c2
   FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   GROUP BY l_returnflag;
   ```
   
   ---
   
   ## Rerun — reviewer-aligned changes
   
   **CAST (Scenario 2):** both distincts use `CAST` on a **column** only (no 
`*` expression inside `CAST`):
   
   ```sql
   SELECT l_suppkey,
          COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx,
          COUNT(DISTINCT CAST(l_discount AS INT)) AS cy
   FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   GROUP BY l_suppkey;
   ```
   
   **High cardinality (Scenario 3):** same aggregates as before, but **`GROUP 
BY l_suppkey`** instead of `l_returnflag`:
   
   ```sql
   SELECT l_suppkey,
          COUNT(DISTINCT lower(l_shipmode)) AS c1,
          COUNT(DISTINCT l_shipinstruct) AS c2
   FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   GROUP BY l_suppkey;
   ```
   
   ### Results (mean ± sample stddev, **n=10** per cell)
   
   Measured with `/usr/bin/time -l` and `target/release/datafusion-cli` from 
the repo root. RSS is **maximum resident set size** from `time -l` (MB here are 
decimal megabytes: bytes / 10⁶). Queries are the **reviewer-aligned grouped 
aggregates above** (no outer `SUM`).
   
   **wall (s) mean ± stddev**
   
   | Scenario | Rewrite OFF | Rewrite ON | Outcome |
   |----------|------------:|-----------:|---------|
   | 1) three distincts | 0.2490 ± 0.0099 | 0.1000 ± 0.0000 | Lower mean wall 
time with ON |
   | 2) cast-based (column-only `CAST`) | 0.1350 ± 0.0053 | 0.0700 ± 0.0000 | 
Lower mean wall time with ON |
   | 3) lower()+string (`GROUP BY l_suppkey`) | 3.3920 ± 0.6496 | 0.0800 ± 
0.0000 | Much lower mean wall time with ON |
   
   **max RSS (MB) mean ± stddev**
   
   | Scenario | Rewrite OFF | Rewrite ON | Outcome |
   |----------|------------:|-----------:|---------|
   | 1) three distincts | 590.03 ± 10.12 | 722.62 ± 12.70 | Higher mean peak 
RSS with ON |
   | 2) cast-based (column-only `CAST`) | 331.83 ± 4.56 | 638.49 ± 8.29 | 
Higher mean peak RSS with ON |
   | 3) lower()+string (`GROUP BY l_suppkey`) | 9755.13 ± 114.94 | 277.66 ± 
1.71 | Much lower mean peak RSS with ON |
   
   Row outputs matched OFF vs ON for every scenario (same grouped aggregate 
result). Wall time from `time` is low-precision on macOS (e.g. all runs may 
round to the same hundredth of a second, so stddev can be 0).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

Reply via email to