alamb opened a new issue, #15927:
URL: https://github.com/apache/datafusion/issues/15927

   ### Describe the bug
   
   
https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench#extended-queries
   
   > The "extended" queries are not part of the official ClickBench benchmark. 
Instead they are used to test other DataFusion features that are not covered by 
the standard benchmark. 
   
   Recently I tried to run Q5 for benchmarking and I got an error:
   
   > Error during planning: WITHIN GROUP clause is required when calling 
ordered set aggregate function(approx_percentile_cont)
   
   
   
   ### To Reproduce
   
   Run this query in datafusion-cli
   
   ```sql
   SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") tmin, 
APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, 
MAX("ResponseStartTiming") tmax
   FROM 'hits.parquet'
   WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
   GROUP BY  "ClientIP", "WatchID"
   HAVING c > 1
   ORDER BY tp95 DESC
   LIMIT 10;
   ```
   
   For example:
   ```sql
   (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ 
~/Software/datafusion/target/debug/datafusion-cli
   DataFusion CLI v47.0.0
   > SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") 
tmin, APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, 
MAX("ResponseStartTiming") tmax
   FROM 'hits.parquet'
   WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
   GROUP BY  "ClientIP", "WatchID"
   HAVING c > 1
   ORDER BY tp95 DESC
   LIMIT 10;
   Error during planning: WITHIN GROUP clause is required when calling ordered 
set aggregate function(approx_percentile_cont)
   ```
   
   ### Expected behavior
   
   DF 47 runs the query
   
   ```
   (venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ 
~/Software/datafusion-cli/datafusion-cli-47.0.0
   DataFusion CLI v47.0.0
   > SELECT "ClientIP", "WatchID",  COUNT(*) c, MIN("ResponseStartTiming") 
tmin, APPROX_PERCENTILE_CONT("ResponseStartTiming", 0.95) tp95, 
MAX("ResponseStartTiming") tmax
   FROM 'hits.parquet'
   WHERE "JavaEnable" = 0 -- filters to 32M of 100M rows
   GROUP BY  "ClientIP", "WatchID"
   HAVING c > 1
   ORDER BY tp95 DESC
   LIMIT 10;
   +-------------+---------------------+---+------+------+------+
   | ClientIP    | WatchID             | c | tmin | tp95 | tmax |
   +-------------+---------------------+---+------+------+------+
   | 1611957945  | 6655575552203051303 | 2 | 0    | 0    | 0    |
   | -1402644643 | 8566928176839891583 | 2 | 0    | 0    | 0    |
   +-------------+---------------------+---+------+------+------+
   2 row(s) fetched.
   Elapsed 5.360 seconds.
   ```
   
   
   ### Additional context
   
   
   However, it seems like postgres also rejects such queries without WIHINGROUP:
   
   See this dbfiddle: https://www.db-fiddle.com/f/5dwiFr16TvBF8zF6f2TmSz/0
   
   
![Image](https://github.com/user-attachments/assets/d42e703a-769b-4cfb-b48a-a65aa79bb7cf)
   
   So I believe the right fix is to update the benchmark query


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to