This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion.git
The following commit(s) were added to refs/heads/main by this push:
new 416ff2d1e3 Add "Extended Clickbench" benchmark for median and
approx_median for high cardinality aggregates (#12438)
416ff2d1e3 is described below
commit 416ff2d1e39daa342be2f9f4e283acaeb66ee0dd
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Sep 16 13:56:23 2024 -0400
Add "Extended Clickbench" benchmark for median and approx_median for high
cardinality aggregates (#12438)
* Add "Extended Clickbench" benchmark for median and approx_median for high
cardinality aggregates
* Update benchmarks/queries/clickbench/extended.sql
Co-authored-by: Eduard Karacharov <[email protected]>
* fix doc
* Update queries to be something more reasonable
---------
Co-authored-by: Eduard Karacharov <[email protected]>
---
benchmarks/queries/clickbench/README.md | 58 +++++++++++++++++++++++++++++-
benchmarks/queries/clickbench/extended.sql | 2 ++
2 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/benchmarks/queries/clickbench/README.md
b/benchmarks/queries/clickbench/README.md
index c6bd8fe9d6..6797797409 100644
--- a/benchmarks/queries/clickbench/README.md
+++ b/benchmarks/queries/clickbench/README.md
@@ -60,7 +60,7 @@ LIMIT 10;
### Q3: What is the income distribution for users in specific regions
-**Question**: "What regions and social networks have the highest variance of
parameter price
+**Question**: "What regions and social networks have the highest variance of
parameter price?"
**Important Query Properties**: STDDEV and VAR aggregation functions, GROUP BY
multiple small ints
@@ -73,6 +73,62 @@ ORDER BY s DESC
LIMIT 10;
```
+### Q4: Response start time distribution analysis (median)
+
+**Question**: Find the WatchIDs with the highest median "ResponseStartTiming"
without Java enabled
+
+**Important Query Properties**: MEDIAN, functions, high cardinality grouping
that skips intermediate aggregation
+
+Note this query is somewhat synthetic as "WatchID" is almost unique (there are
a few duplicates)
+
+```sql
+SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin,
MEDIAN("ResponseStartTiming") tmed, MAX("ResponseStartTiming") tmax
+FROM 'hits.parquet'
+WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
+GROUP BY "ClientIP", "WatchID"
+HAVING c > 1
+ORDER BY tmed DESC
+LIMIT 10;
+```
+
+Results look like
+
++-------------+---------------------+---+------+------+------+
+| ClientIP | WatchID | c | tmin | tmed | tmax |
++-------------+---------------------+---+------+------+------+
+| 1611957945 | 6655575552203051303 | 2 | 0 | 0 | 0 |
+| -1402644643 | 8566928176839891583 | 2 | 0 | 0 | 0 |
++-------------+---------------------+---+------+------+------+
+
+
+### Q5: Response start time distribution analysis (p95)
+
+**Question**: Find the WatchIDs with the highest p95 "ResponseStartTiming"
without Java enabled
+
+**Important Query Properties**: APPROX_PERCENTILE_CONT, functions, high
cardinality grouping that skips intermediate aggregation
+
+Note this query is somewhat synthetic as "WatchID" is almost unique (there are
a few duplicates)
+
+```sql
+SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin,
APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95,
MAX("ResponseStartTiming") tmax
+FROM 'hits.parquet'
+WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
+GROUP BY "ClientIP", "WatchID"
+HAVING c > 1
+ORDER BY tp95 DESC
+LIMIT 10;
+```
+
+Results look like
+
++-------------+---------------------+---+------+------+------+
+| ClientIP | WatchID | c | tmin | tp95 | tmax |
++-------------+---------------------+---+------+------+------+
+| 1611957945 | 6655575552203051303 | 2 | 0 | 0 | 0 |
+| -1402644643 | 8566928176839891583 | 2 | 0 | 0 | 0 |
++-------------+---------------------+---+------+------+------+
+
+
## Data Notes
Here are some interesting statistics about the data used in the queries
diff --git a/benchmarks/queries/clickbench/extended.sql
b/benchmarks/queries/clickbench/extended.sql
index 2f814ad845..fbabaf2a70 100644
--- a/benchmarks/queries/clickbench/extended.sql
+++ b/benchmarks/queries/clickbench/extended.sql
@@ -2,3 +2,5 @@ SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT
"MobilePhone"), COUNT(DIST
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"),
COUNT(DISTINCT "BrowserLanguage") FROM hits;
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT
"HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction")
FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
SELECT "SocialSourceNetworkID", "RegionID", COUNT(*), AVG("Age"),
AVG("ParamPrice"), STDDEV("ParamPrice") as s, VAR("ParamPrice") FROM hits
GROUP BY "SocialSourceNetworkID", "RegionID" HAVING s IS NOT NULL ORDER BY s
DESC LIMIT 10;
+SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin,
MEDIAN("ResponseStartTiming") tmed, MAX("ResponseStartTiming") tmax FROM hits
WHERE "JavaEnable" = 0 GROUP BY "ClientIP", "WatchID" HAVING c > 1 ORDER BY
tmed DESC LIMIT 10;
+SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin,
APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95,
MAX("ResponseStartTiming") tmax FROM 'hits' WHERE "JavaEnable" = 0 GROUP BY
"ClientIP", "WatchID" HAVING c > 1 ORDER BY tp95 DESC LIMIT 10;
\ No newline at end of file
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]