ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4110284239

   Following up with additional real-data benchmark results for 
`MultiDistinctCountRewrite` (config-gated by 
`datafusion.optimizer.enable_multi_distinct_count_rewrite`, default `false`).
   
   ## Dataset generation / setup
   
   I used local TPCH SF1 parquet generated with:
   
   ```bash
   cargo install tpchgen-cli
   tpchgen-cli --scale-factor 1 --format parquet --parquet-compression 
'ZSTD(1)' --parts 1 --output-dir benchmarks/data/tpch_sf1
   ```
   
   Data path used in all queries:
   `benchmarks/data/tpch_sf1/lineitem/*.parquet`
   
   For timing/memory, each run used:
   
   ```bash
   /usr/bin/time -l datafusion-cli -q --format csv -c "SET 
datafusion.optimizer.enable_multi_distinct_count_rewrite = <true|false>;" -c 
"<QUERY>"
   ```
   
   (Reporting wall time + max RSS from `time -l`.)
   
   ## Queries tested
   
   ### Scenario 1: 3 distincts (grouped)
   
   ```sql
   SELECT SUM(d1) AS s1, SUM(d2) AS s2, SUM(d3) AS s3
   FROM (
     SELECT l_suppkey,
            COUNT(DISTINCT l_orderkey) AS d1,
            COUNT(DISTINCT l_partkey) AS d2,
            COUNT(DISTINCT l_linenumber) AS d3
     FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
     GROUP BY l_suppkey
   );
   ```
   
   ### Scenario 2: CAST-based distincts
   
   ```sql
   SELECT SUM(cx) AS sx, SUM(cy) AS sy
   FROM (
     SELECT l_suppkey,
            COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx,
            COUNT(DISTINCT CAST(l_discount * 100 AS INT)) AS cy
     FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
     GROUP BY l_suppkey
   );
   ```
   
   ### Scenario 3: lower() + string distinct
   
   ```sql
   SELECT SUM(c1) AS s1, SUM(c2) AS s2
   FROM (
     SELECT l_returnflag,
            COUNT(DISTINCT lower(l_shipmode)) AS c1,
            COUNT(DISTINCT l_shipinstruct) AS c2
     FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
     GROUP BY l_returnflag
   );
   ```
   
   ## Results (latency + max RSS)
   
   | Scenario | Rewrite OFF | Rewrite ON | Outcome |
   |---|---:|---:|---|
   | 1) three distincts | 0.28s, ~570 MB | 0.12s, ~730 MB | Faster, higher 
memory |
   | 2) cast-based distincts | 0.17s, ~349 MB | 0.16s, ~350 MB | Near tie |
   | 3) lower()+string distinct | 0.04s, ~99 MB | 0.06s, ~94 MB | Slightly 
slower, slightly lower memory |
   
   All paired runs returned identical query results.
   
   ## Takeaway
   
   The rewrite can improve latency in some multi-distinct grouped cases, but 
can also increase peak memory and/or be neutral/slightly slower depending on 
cardinality and expression shape. Keeping this rule opt-in (default off) 
remains appropriate while we expand benchmark coverage.
   
   
   
   > Yes, I can see it would improve a lot of queries
   > 
   > > Thanks @Dandandan @xiedeyantu for the discussion.
   > > **Join cost / default behavior:** Agreed that joining on grouping keys 
isn’t free. We’ve added 
**`datafusion.optimizer.enable_multi_distinct_count_rewrite`** with default 
**`false`**, so the rewrite is **opt-in** until we have benchmarks that justify 
turning it on by default. Sessions can enable it explicitly when they want to 
try the plan shape.
   > > **Benchmarks:** We’re aligned that we should measure **latency and 
memory** vs baseline across scenarios (e.g. multiple `COUNT(DISTINCT …)` with 
`GROUP BY`, varying group-key cardinality vs distinct cardinality). We’ll 
follow up with numbers on the PR.
   > > **GroupsAccumulator / execution-layer improvements:** That work is 
complementary: better distinct accumulators help **how** each aggregate runs; 
this rule changes **logical plan shape** when several large distincts share one 
aggregate. Both can coexist on the roadmap.
   > > **Tests:** We added optimizer + SQL integration coverage (including 
cases with `lower`/`CAST` inside `COUNT(DISTINCT …)` and a test that the rule 
is a no-op when the config is off). Happy to iterate on naming and placement 
with maintainers.
   > > > > I think a rewrite like this might be useful, but I think it can also 
hurt performance because of the join on grouping keys. So I think it needs to 
have a config value (off by default) or when enabled some benchmarks showing 
that it is better in large majority of the cases.
   > > > > I am also wondering if mostly for memory usage a `GroupsAccumulator` 
for distinct count / sum might give similar/more improvements.
   > > > 
   > > > 
   > > > @Dandandan Thank you for the explanation. It’s true that this would 
add a hash join, but if aggregation can be performed in parallel, there might 
be advantages in scenarios with two or more COUNT(DISTINCT) operations. I agree 
to run performance tests across multiple scenarios to evaluate the actual 
results.
   > 
   > Sounds good - agreed.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to