goldmedal commented on issue #12013: URL: https://github.com/apache/datafusion/issues/12013#issuecomment-2294902506
> I guess this issue is similar to #11897 #11748 I didn't dig into this issue but may be related 🤔 After some research, I believe this is a bug of #11681. When planning an aggregation, we perform some validations at https://github.com/apache/datafusion/blob/cb1e3f0e5cfd04195ae0c933b74f3eb3fcbe1b45/datafusion/sql/src/select.rs#L752-L770 However, after moving the wildcard expansion to the analyzer, we can no longer determine the projection expressions at this point. This causes the validations to fail, resulting in an invalid plan. Essentially, it causes another bug. Consider the following case: ``` create table t1(v1 int, v2 int); insert into t1 values (1,2),(2,4),(3,6); select * from t1 group by v1, v2 having max(v1)=3; ---- +----+----+------------+ | v1 | v2 | max(t1.v1) | +----+----+------------+ | 3 | 6 | 3 | +----+----+------------+ ``` The aggregation field for the `having` clause shouldn't be printed. This issue relates to how wildcards are expanded based on a filter plan. I attempted to resolve this issue roughly but encountered another problem with error messages. Consider the SQL provided by @2010YOUY01: ``` select * from t1 having max(v1)=3; --- Error: expand_wildcard_rule caused by Schema error: No field named t1.v1. Valid fields are "max(t1.v1)". ``` In the previous version (41.0.0), the message would have been: ``` > select * from t1 having max(v1)=3; Error during planning: Projection references non-aggregate values: Expression t1.v1 could not be resolved from available columns: max(t1.v1) ``` I think we can expand the wildcard when planning the aggregation for validation if the group-by keys are empty. I can ensure that if the group-by keys are empty, the SQL isn't valid for the `having` clause, resulting in a better error message. However, if the group-by keys aren't empty and we don't expand the wildcard, it becomes difficult to provide a correct error message. The SQL will fail when invoking `ExpandWildCardRule`. For example: ``` create table t1(v1 int, v2 int); insert into t1 values (1, 2),(2,3),(3,4); select * from t1 group by v1 having max(v1)=3 ---- Error: expand_wildcard_rule caused by Schema error: No field named t1.v2. Valid fields are t1.v1, "max(t1.v1)". ``` In the previous version (41.0.0), the error message would have been: ``` > select * from t2 group by v1 having max(v2) = 2; Error during planning: Projection references non-aggregate values: Expression t2.v2 could not be resolved from available columns: t2.v1, max(t2.v2) ``` To avoid duplicate expansion, I don't prefer expanding the wildcard when the group-by keys aren't empty. However, this results in an unclear error message for the user 🤔. I'll draft a PR to explain this more clearly. cc @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
