xiedeyantu commented on PR #21088: URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4133837810
> Hi @xiedeyantu update on the benchmarking. The new benchmarks use the **grouped aggregate only** (no outer `SUM` or other wrapper) so timings reflect the multi–`COUNT(DISTINCT …)` work directly. Note: Figures below are **warm-cache** repeats on one machine (relative A/B, not a CI guarantee). Also, these figures are from running each of the query 10 times and flipping the rewrite setting on and off using the grouped queries above (no outer `SUM`). > > ## Queries tested - original runs > ### Scenario 1: three distincts (grouped) > ```sql > SELECT l_suppkey, > COUNT(DISTINCT l_orderkey) AS d1, > COUNT(DISTINCT l_partkey) AS d2, > COUNT(DISTINCT l_linenumber) AS d3 > FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet' > GROUP BY l_suppkey; > ``` > > ## Rerun: reviewer-aligned changes > **CAST (Scenario 2):** both distincts use `CAST` on a **column** only (no `*` expression inside `CAST`): > > ```sql > SELECT l_suppkey, > COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx, > COUNT(DISTINCT CAST(l_discount AS INT)) AS cy > FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet' > GROUP BY l_suppkey; > ``` > > **High cardinality (Scenario 3):** same aggregates as before, but **`GROUP BY l_suppkey`** instead of `l_returnflag`: > > ```sql > SELECT l_suppkey, > COUNT(DISTINCT lower(l_shipmode)) AS c1, > COUNT(DISTINCT l_shipinstruct) AS c2 > FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet' > GROUP BY l_suppkey; > ``` > > ### Results (mean ± sample stddev, **n=10** per cell) > Measured with `/usr/bin/time -l` and `target/release/datafusion-cli` from the repo root. RSS is **maximum resident set size** from `time -l` (MB here are decimal megabytes: bytes / 10⁶). Queries are the **reviewer-aligned grouped aggregates above** (no outer `SUM`). > > **wall (s) mean ± stddev** > > Scenario Rewrite OFF Rewrite ON Outcome > 1) three distincts 0.2490 ± 0.0099 0.1000 ± 0.0000 Lower mean wall time with ON > 2) cast-based (column-only `CAST`) 0.1350 ± 0.0053 0.0700 ± 0.0000 Lower mean wall time with ON > 3) lower()+string (`GROUP BY l_suppkey`) 3.3920 ± 0.6496 0.0800 ± 0.0000 Much lower mean wall time with ON > **max RSS (MB) mean ± stddev** > > Scenario Rewrite OFF Rewrite ON Outcome > 1) three distincts 590.03 ± 10.12 722.62 ± 12.70 Higher mean peak RSS with ON > 2) cast-based (column-only `CAST`) 331.83 ± 4.56 638.49 ± 8.29 Higher mean peak RSS with ON > 3) lower()+string (`GROUP BY l_suppkey`) 9755.13 ± 114.94 277.66 ± 1.71 Much lower mean peak RSS with ON > Row outputs matched OFF vs ON for every scenario (same grouped aggregate result). Wall time from `time` is low-precision on macOS (e.g. all runs may round to the same hundredth of a second, so stddev can be 0). Thank you for the detailed testing! The test results look excellent to me. I have no further suggestions at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
