Re: [PR] Add "Extended Clickbench" benchmark for median and approx_median for high cardinality aggregates [datafusion]

via GitHub Thu, 12 Sep 2024 11:03:24 -0700


alamb commented on code in PR #12438:
URL: https://github.com/apache/datafusion/pull/12438#discussion_r1757338949



##########
benchmarks/queries/clickbench/extended.sql:
##########
@@ -2,3 +2,4 @@ SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT 
"MobilePhone"), COUNT(DIST
 SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), 
COUNT(DISTINCT "BrowserLanguage")  FROM hits;
 SELECT "BrowserCountry",  COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT 
"HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction") 
FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
 SELECT "SocialSourceNetworkID", "RegionID", COUNT(*), AVG("Age"), 
AVG("ParamPrice"), STDDEV("ParamPrice") as s, VAR("ParamPrice")  FROM hits 
GROUP BY "SocialSourceNetworkID", "RegionID" HAVING s IS NOT NULL ORDER BY s 
DESC LIMIT 10;
+SELECT MIN("ResponseStartTiming") tmin, MEDIAN("ResponseStartTiming") tmed, 
approx_percentile_cont("ResponseStartTiming", 0.95) tp95, 
approx_percentile_cont("ResponseStartTiming", 0.95) tp99, 
MAX("ResponseStartTiming") tmax,  "UserID" FROM hits GROUP BY "UserID" HAVING 
tmin > 0 AND tmed > 0 ORDER BY tp95 DESC LIMIT 10;

Review Comment:
   When I changed it to `GROUP BY "UserID","WatchID", "ClientIP"` I see the 
skipped aggregation rows now:
   
   skipped_aggregation_rows=98293561
   
   > AggregateExec: mode=Partial, gby=[UserID@2 as UserID, WatchID@0 as 
WatchID, ClientIP@1 as ClientIP], aggr=[min(hits.parquet.ResponseStartTiming), 
median(hits.parquet.ResponseStartTiming), 
approx_percentile_cont(hits.parquet.ResponseStartTiming,Float64(0.95)), 
max(hits.parquet.ResponseStartTiming)], metrics=[output_rows=99997497, 
elapsed_compute=246.594546516s, skipped_aggregation_rows=98293561]
   
   ```sql
   > EXPLAIN ANALYZE SELECT MIN("ResponseStartTiming") tmin, 
MEDIAN("ResponseStartTiming") tmed, 
approx_percentile_cont("ResponseStartTiming", 0.95) tp95, 
approx_percentile_cont("ResponseStartTiming", 0.95) tp99, 
MAX("ResponseStartTiming") tmax,  "UserID", "WatchID", "ClientIP" FROM 
'hits.parquet' GROUP BY "UserID","WatchID", "ClientIP" HAVING tmin > 0 AND tmed 
> 0 ORDER BY tp95 DESC LIMIT 10;
   ```
   
   
   And with that change now the query doesn't finish on my laptop (I killed it 
after using 100GB of memory) but with 
https://github.com/apache/datafusion/pull/11827 it finishes in 45 seconds.
   
   I will think about how to modify this query to reflect the benefit without 
making it impossibly to run on low resource machines



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add "Extended Clickbench" benchmark for median and approx_median for high cardinality aggregates [datafusion]

Reply via email to