ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4110284239
Following up with additional real-data benchmark results for
`MultiDistinctCountRewrite` (config-gated by
`datafusion.optimizer.enable_multi_distinct_count_rewrite`, default `false`).
## Dataset generation / setup
I used local TPCH SF1 parquet generated with:
```bash
cargo install tpchgen-cli
tpchgen-cli --scale-factor 1 --format parquet --parquet-compression
'ZSTD(1)' --parts 1 --output-dir benchmarks/data/tpch_sf1
```
Data path used in all queries:
`benchmarks/data/tpch_sf1/lineitem/*.parquet`
For timing/memory, each run used:
```bash
/usr/bin/time -l datafusion-cli -q --format csv -c "SET
datafusion.optimizer.enable_multi_distinct_count_rewrite = <true|false>;" -c
"<QUERY>"
```
(Reporting wall time + max RSS from `time -l`.)
## Queries tested
### Scenario 1: 3 distincts (grouped)
```sql
SELECT SUM(d1) AS s1, SUM(d2) AS s2, SUM(d3) AS s3
FROM (
SELECT l_suppkey,
COUNT(DISTINCT l_orderkey) AS d1,
COUNT(DISTINCT l_partkey) AS d2,
COUNT(DISTINCT l_linenumber) AS d3
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_suppkey
);
```
### Scenario 2: CAST-based distincts
```sql
SELECT SUM(cx) AS sx, SUM(cy) AS sy
FROM (
SELECT l_suppkey,
COUNT(DISTINCT CAST(l_extendedprice AS INT)) AS cx,
COUNT(DISTINCT CAST(l_discount * 100 AS INT)) AS cy
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_suppkey
);
```
### Scenario 3: lower() + string distinct
```sql
SELECT SUM(c1) AS s1, SUM(c2) AS s2
FROM (
SELECT l_returnflag,
COUNT(DISTINCT lower(l_shipmode)) AS c1,
COUNT(DISTINCT l_shipinstruct) AS c2
FROM 'benchmarks/data/tpch_sf1/lineitem/*.parquet'
GROUP BY l_returnflag
);
```
## Results (latency + max RSS)
| Scenario | Rewrite OFF | Rewrite ON | Outcome |
|---|---:|---:|---|
| 1) three distincts | 0.28s, ~570 MB | 0.12s, ~730 MB | Faster, higher
memory |
| 2) cast-based distincts | 0.17s, ~349 MB | 0.16s, ~350 MB | Near tie |
| 3) lower()+string distinct | 0.04s, ~99 MB | 0.06s, ~94 MB | Slightly
slower, slightly lower memory |
All paired runs returned identical query results.
## Takeaway
The rewrite can improve latency in some multi-distinct grouped cases, but
can also increase peak memory and/or be neutral/slightly slower depending on
cardinality and expression shape. Keeping this rule opt-in (default off)
remains appropriate while we expand benchmark coverage.
> Yes, I can see it would improve a lot of queries
>
> > Thanks @Dandandan @xiedeyantu for the discussion.
> > **Join cost / default behavior:** Agreed that joining on grouping keys
isn’t free. We’ve added
**`datafusion.optimizer.enable_multi_distinct_count_rewrite`** with default
**`false`**, so the rewrite is **opt-in** until we have benchmarks that justify
turning it on by default. Sessions can enable it explicitly when they want to
try the plan shape.
> > **Benchmarks:** We’re aligned that we should measure **latency and
memory** vs baseline across scenarios (e.g. multiple `COUNT(DISTINCT …)` with
`GROUP BY`, varying group-key cardinality vs distinct cardinality). We’ll
follow up with numbers on the PR.
> > **GroupsAccumulator / execution-layer improvements:** That work is
complementary: better distinct accumulators help **how** each aggregate runs;
this rule changes **logical plan shape** when several large distincts share one
aggregate. Both can coexist on the roadmap.
> > **Tests:** We added optimizer + SQL integration coverage (including
cases with `lower`/`CAST` inside `COUNT(DISTINCT …)` and a test that the rule
is a no-op when the config is off). Happy to iterate on naming and placement
with maintainers.
> > > > I think a rewrite like this might be useful, but I think it can also
hurt performance because of the join on grouping keys. So I think it needs to
have a config value (off by default) or when enabled some benchmarks showing
that it is better in large majority of the cases.
> > > > I am also wondering if mostly for memory usage a `GroupsAccumulator`
for distinct count / sum might give similar/more improvements.
> > >
> > >
> > > @Dandandan Thank you for the explanation. It’s true that this would
add a hash join, but if aggregation can be performed in parallel, there might
be advantages in scenarios with two or more COUNT(DISTINCT) operations. I agree
to run performance tests across multiple scenarios to evaluate the actual
results.
>
> Sounds good - agreed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]