alamb commented on code in PR #12438:
URL: https://github.com/apache/datafusion/pull/12438#discussion_r1757338949
##########
benchmarks/queries/clickbench/extended.sql:
##########
@@ -2,3 +2,4 @@ SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT
"MobilePhone"), COUNT(DIST
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"),
COUNT(DISTINCT "BrowserLanguage") FROM hits;
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT
"HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction")
FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
SELECT "SocialSourceNetworkID", "RegionID", COUNT(*), AVG("Age"),
AVG("ParamPrice"), STDDEV("ParamPrice") as s, VAR("ParamPrice") FROM hits
GROUP BY "SocialSourceNetworkID", "RegionID" HAVING s IS NOT NULL ORDER BY s
DESC LIMIT 10;
+SELECT MIN("ResponseStartTiming") tmin, MEDIAN("ResponseStartTiming") tmed,
approx_percentile_cont("ResponseStartTiming", 0.95) tp95,
approx_percentile_cont("ResponseStartTiming", 0.95) tp99,
MAX("ResponseStartTiming") tmax, "UserID" FROM hits GROUP BY "UserID" HAVING
tmin > 0 AND tmed > 0 ORDER BY tp95 DESC LIMIT 10;
Review Comment:
When I changed it to `GROUP BY "UserID","WatchID", "ClientIP"` I see the
skipped aggregation rows now:
skipped_aggregation_rows=98293561
> AggregateExec: mode=Partial, gby=[UserID@2 as UserID, WatchID@0 as
WatchID, ClientIP@1 as ClientIP], aggr=[min(hits.parquet.ResponseStartTiming),
median(hits.parquet.ResponseStartTiming),
approx_percentile_cont(hits.parquet.ResponseStartTiming,Float64(0.95)),
max(hits.parquet.ResponseStartTiming)], metrics=[output_rows=99997497,
elapsed_compute=246.594546516s, skipped_aggregation_rows=98293561]
```sql
> EXPLAIN ANALYZE SELECT MIN("ResponseStartTiming") tmin,
MEDIAN("ResponseStartTiming") tmed,
approx_percentile_cont("ResponseStartTiming", 0.95) tp95,
approx_percentile_cont("ResponseStartTiming", 0.95) tp99,
MAX("ResponseStartTiming") tmax, "UserID", "WatchID", "ClientIP" FROM
'hits.parquet' GROUP BY "UserID","WatchID", "ClientIP" HAVING tmin > 0 AND tmed
> 0 ORDER BY tp95 DESC LIMIT 10;
```
And with that change now the query doesn't finish on my laptop (I killed it
after using 100GB of memory) but with
https://github.com/apache/datafusion/pull/11827 it finishes in 45 seconds.
I will think about how to modify this query to reflect the benefit without
making it impossibly to run on low resource machines
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]