This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git


The following commit(s) were added to refs/heads/main by this push:
     new 416ff2d1e3 Add "Extended Clickbench" benchmark for median and 
approx_median for high cardinality aggregates (#12438)
416ff2d1e3 is described below

commit 416ff2d1e39daa342be2f9f4e283acaeb66ee0dd
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Sep 16 13:56:23 2024 -0400

    Add "Extended Clickbench" benchmark for median and approx_median for high 
cardinality aggregates (#12438)
    
    * Add "Extended Clickbench" benchmark for median and approx_median for high 
cardinality aggregates
    
    * Update benchmarks/queries/clickbench/extended.sql
    
    Co-authored-by: Eduard Karacharov <[email protected]>
    
    * fix doc
    
    * Update queries to be something more reasonable
    
    ---------
    
    Co-authored-by: Eduard Karacharov <[email protected]>
---
 benchmarks/queries/clickbench/README.md    | 58 +++++++++++++++++++++++++++++-
 benchmarks/queries/clickbench/extended.sql |  2 ++
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/benchmarks/queries/clickbench/README.md 
b/benchmarks/queries/clickbench/README.md
index c6bd8fe9d6..6797797409 100644
--- a/benchmarks/queries/clickbench/README.md
+++ b/benchmarks/queries/clickbench/README.md
@@ -60,7 +60,7 @@ LIMIT 10;
 
 ### Q3: What is the income distribution for users in specific regions
 
-**Question**: "What regions and social networks have the highest variance of 
parameter price 
+**Question**: "What regions and social networks have the highest variance of 
parameter price?"
 
 **Important Query Properties**: STDDEV and VAR aggregation functions, GROUP BY 
multiple small ints
 
@@ -73,6 +73,62 @@ ORDER BY s DESC
 LIMIT 10;
 ```
 
+### Q4: Response start time distribution analysis (median)
+
+**Question**:  Find the WatchIDs with the highest median "ResponseStartTiming" 
without Java enabled
+
+**Important Query Properties**: MEDIAN, functions, high cardinality grouping 
that skips intermediate aggregation
+
+Note this query is somewhat synthetic as "WatchID" is almost unique (there are 
a few duplicates)
+
+```sql
+SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, 
MEDIAN("ResponseStartTiming") tmed, MAX("ResponseStartTiming") tmax
+FROM 'hits.parquet'
+WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
+GROUP BY  "ClientIP", "WatchID"
+HAVING c > 1
+ORDER BY tmed DESC
+LIMIT 10;
+```
+
+Results look like
+
++-------------+---------------------+---+------+------+------+
+| ClientIP    | WatchID             | c | tmin | tmed | tmax |
++-------------+---------------------+---+------+------+------+
+| 1611957945  | 6655575552203051303 | 2 | 0    | 0    | 0    |
+| -1402644643 | 8566928176839891583 | 2 | 0    | 0    | 0    |
++-------------+---------------------+---+------+------+------+
+
+
+### Q5: Response start time distribution analysis (p95)
+
+**Question**:  Find the WatchIDs with the highest p95 "ResponseStartTiming" 
without Java enabled
+
+**Important Query Properties**: APPROX_PERCENTILE_CONT, functions, high 
cardinality grouping that skips intermediate aggregation
+
+Note this query is somewhat synthetic as "WatchID" is almost unique (there are 
a few duplicates)
+
+```sql
+SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, 
APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, 
MAX("ResponseStartTiming") tmax
+FROM 'hits.parquet'
+WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
+GROUP BY  "ClientIP", "WatchID"
+HAVING c > 1
+ORDER BY tp95 DESC
+LIMIT 10;
+```
+
+Results look like
+
++-------------+---------------------+---+------+------+------+
+| ClientIP    | WatchID             | c | tmin | tp95 | tmax |
++-------------+---------------------+---+------+------+------+
+| 1611957945  | 6655575552203051303 | 2 | 0    | 0    | 0    |
+| -1402644643 | 8566928176839891583 | 2 | 0    | 0    | 0    |
++-------------+---------------------+---+------+------+------+
+
+
 ## Data Notes
 
 Here are some interesting statistics about the data used in the queries
diff --git a/benchmarks/queries/clickbench/extended.sql 
b/benchmarks/queries/clickbench/extended.sql
index 2f814ad845..fbabaf2a70 100644
--- a/benchmarks/queries/clickbench/extended.sql
+++ b/benchmarks/queries/clickbench/extended.sql
@@ -2,3 +2,5 @@ SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT 
"MobilePhone"), COUNT(DIST
 SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), 
COUNT(DISTINCT "BrowserLanguage")  FROM hits;
 SELECT "BrowserCountry",  COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT 
"HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction") 
FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
 SELECT "SocialSourceNetworkID", "RegionID", COUNT(*), AVG("Age"), 
AVG("ParamPrice"), STDDEV("ParamPrice") as s, VAR("ParamPrice")  FROM hits 
GROUP BY "SocialSourceNetworkID", "RegionID" HAVING s IS NOT NULL ORDER BY s 
DESC LIMIT 10;
+SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, 
MEDIAN("ResponseStartTiming") tmed, MAX("ResponseStartTiming") tmax FROM hits 
WHERE "JavaEnable" = 0  GROUP BY  "ClientIP", "WatchID" HAVING c > 1 ORDER BY 
tmed DESC LIMIT 10;
+SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, 
APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, 
MAX("ResponseStartTiming") tmax FROM 'hits' WHERE "JavaEnable" = 0 GROUP BY  
"ClientIP", "WatchID" HAVING c > 1 ORDER BY tp95 DESC LIMIT 10;
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to