Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

via GitHub Thu, 26 Mar 2026 04:29:28 -0700


xiedeyantu commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4133837810


   > Hi @xiedeyantu update on the benchmarking. The new benchmarks use the 
**grouped aggregate only** (no outer `SUM` or other wrapper) so timings reflect 
the multi–`COUNT(DISTINCT …)` work directly. Note: Figures below are 
**warm-cache** repeats on one machine (relative A/B, not a CI guarantee). Also, 
these figures are from running each of the query 10 times and flipping the 
rewrite setting on and off using the grouped queries above (no outer `SUM`).
   > 
   > ## Queries tested - original runs
   > ### Scenario 1: three distincts (grouped)
   > ```sql
   > SELECT l_suppkey,
   >        COUNT(DISTINCT l_orderkey) AS d1,
   >        COUNT(DISTINCT l_partkey) AS d2,
   >        COUNT(DISTINCT l_linenumber) AS d3
   > FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   > GROUP BY l_suppkey;
   > ```
   > 
   > ## Rerun: reviewer-aligned changes
   > **CAST (Scenario 2):** both distincts use `CAST` on a **column** only (no 
`*` expression inside `CAST`):
   > 
   > ```sql
   > SELECT l_suppkey,
   >        COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx,
   >        COUNT(DISTINCT CAST(l_discount AS INT)) AS cy
   > FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   > GROUP BY l_suppkey;
   > ```
   > 
   > **High cardinality (Scenario 3):** same aggregates as before, but **`GROUP 
BY l_suppkey`** instead of `l_returnflag`:
   > 
   > ```sql
   > SELECT l_suppkey,
   >        COUNT(DISTINCT lower(l_shipmode)) AS c1,
   >        COUNT(DISTINCT l_shipinstruct) AS c2
   > FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
   > GROUP BY l_suppkey;
   > ```
   > 
   > ### Results (mean ± sample stddev, **n=10** per cell)
   > Measured with `/usr/bin/time -l` and `target/release/datafusion-cli` from 
the repo root. RSS is **maximum resident set size** from `time -l` (MB here are 
decimal megabytes: bytes / 10⁶). Queries are the **reviewer-aligned grouped 
aggregates above** (no outer `SUM`).
   > 
   > **wall (s) mean ± stddev**
   > 
   > Scenario   Rewrite OFF     Rewrite ON      Outcome
   > 1) three distincts 0.2490 ± 0.0099 0.1000 ± 0.0000 Lower mean wall time 
with ON
   > 2) cast-based (column-only `CAST`) 0.1350 ± 0.0053 0.0700 ± 0.0000 Lower 
mean wall time with ON
   > 3) lower()+string (`GROUP BY l_suppkey`)   3.3920 ± 0.6496 0.0800 ± 0.0000 
Much lower mean wall time with ON
   > **max RSS (MB) mean ± stddev**
   > 
   > Scenario   Rewrite OFF     Rewrite ON      Outcome
   > 1) three distincts 590.03 ± 10.12  722.62 ± 12.70  Higher mean peak RSS 
with ON
   > 2) cast-based (column-only `CAST`) 331.83 ± 4.56   638.49 ± 8.29   Higher 
mean peak RSS with ON
   > 3) lower()+string (`GROUP BY l_suppkey`)   9755.13 ± 114.94        277.66 
± 1.71   Much lower mean peak RSS with ON
   > Row outputs matched OFF vs ON for every scenario (same grouped aggregate 
result). Wall time from `time` is low-precision on macOS (e.g. all runs may 
round to the same hundredth of a second, so stddev can be 0).
   
   Thank you for the detailed testing! The test results look excellent to me. I 
have no further suggestions at this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

Reply via email to