yyin-dev commented on PR #10713: URL: https://github.com/apache/datafusion/pull/10713#issuecomment-2143594191
> > @jayzhan211 I'm working on a change, but can you help me understand the semantics here: > > ``` > > # csv_query_distinct_variance > > query R > > SELECT var(distinct c2) FROM aggregate_test_100 > > ---- > > 2.5 > > > > statement error DataFusion error: This feature is not implemented: VAR\(DISTINCT\) aggregations are not available > > SELECT var(c2), var(distinct c2) FROM aggregate_test_100 > > ``` > > > > > > > > > > > > > > > > > > > > > > > > Why should the first query succeed but not the second one? Feel free to point me to any SQL / datafusion doc. > > I think it is because of optimize rule `SingleDistinctToGroupBy`, this rule convert distinct to group by, so the first query is no longer `distinct`, you can try adding `explain` to see the optimized logical plan. I'm thinking about the right way to implement error-raising. Before migration, the logic was implemented in `physical-exp/src/aggregate/build_in.rs:create_aggregate_expr` as a match statement. After migration, the error should probably be raised in `phyical-expr-common/src/aggregate/mod.rs:create_aggregate_expr`. There are two options: 1. Get the udaf's name and implement similar logic. This is simpler but less principled? 2. Adds a `support_distinct` to the `AggregateUDFImpl` trait. This feels like a better solution. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org