alamb opened a new issue, #16955: URL: https://github.com/apache/datafusion/issues/16955
### Is your feature request related to a problem or challenge? We recently rolled out DataFusion 48.0.1 to production and had several customers report that queries that previously ran without issue We traced it back to this PR from @Garamda which updated DataFusion SQL to use the standard WITHIN GROUP syntax: - https://github.com/apache/datafusion/pull/13511 For example, given this data ```sql > create table test(c1 int, c2 float) as values (1, 100), (2, 200), (5, 500), (4, 400), (3, 300), (2, 200); 0 row(s) fetched. > select * from test; +----+-------+ | c1 | c2 | +----+-------+ | 1 | 100.0 | | 2 | 200.0 | | 5 | 500.0 | | 4 | 400.0 | | 3 | 300.0 | | 2 | 200.0 | +----+-------+ 6 row(s) fetched. ``` ## DataFusion 47.0.0 ```sql > SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test; +-------------------------------------------------------------------+ | approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) | +-------------------------------------------------------------------+ | 5.0 | +-------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds. ``` ## Same query errors DataFusion 48.0.0 ```sql > SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test; Error during planning: WITHIN GROUP clause is required when calling ordered set aggregate function(approx_percentile_cont_with_weight) ``` Instead you must use a different syntax ```sql > SELECT approx_percentile_cont_with_weight(c2, 0.95) WITHIN GROUP (ORDER BY c1) FROM test; +-------------------------------------------------------------------------------------------------+ | approx_percentile_cont_with_weight(test.c2,Float64(0.95)) WITHIN GROUP [test.c1 ASC NULLS LAST] | +-------------------------------------------------------------------------------------------------+ | 5.0 | +-------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.003 seconds. ``` The functions affected are: - approx_percentile_cont - 47.0.0 : `approx_percentile_cont(expression, percentile, centroids)` - 48.0.0 : `approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)` - approx_percentile_cont_with_weight - 47.0.0 : `approx_percentile_cont_with_weight(expression, weight, percentile)` - 48.0.0 : `approx_percentile_cont_with_weight(weight, percentile) WITHIN GROUP (ORDER BY expression)` # Other users have been affected too Looking at other issues linked to that PR, this change seems to have caused downstream breakage for others too: - Pydantic Logfire: https://github.com/pydantic/logfire/issues/1044 - Clickbench extended queries: https://github.com/apache/datafusion/issues/15927 ### Describe the solution you'd like I would like DataFusion to support both the new and old syntax for `approx_percentile_cont` and `approx_percentile_cont_with_weight` For example, given this table: ```shell > create table test(c1 int, c2 float) as values (1, 100), (2, 200), (5, 500), (4, 400), (3, 300), (2, 200); For example, that means in addition to supporting the new syntax from DataFusion 48.0.0: ```sql > SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test; +-------------------------------------------------------------------+ | approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) | +-------------------------------------------------------------------+ | 5.0 | +-------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds. ``` A user should also be able to write this (as they were able in DataFusion 47) ```sql > SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM test; +-------------------------------------------------------------------+ | approx_percentile_cont_with_weight(test.c1,test.c2,Float64(0.95)) | +-------------------------------------------------------------------+ | 5.0 | +-------------------------------------------------------------------+ 1 row(s) fetched. ``` And get the same answer ### Describe alternatives you've considered We could only make this change on our own fork ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org