goldmedal commented on issue #12013:
URL: https://github.com/apache/datafusion/issues/12013#issuecomment-2294902506

   > I guess this issue is similar to #11897 #11748
   
   I didn't dig into this issue but may be related 🤔 
   
   After some research, I believe this is a bug of #11681. When planning an 
aggregation, we perform some validations at 
https://github.com/apache/datafusion/blob/cb1e3f0e5cfd04195ae0c933b74f3eb3fcbe1b45/datafusion/sql/src/select.rs#L752-L770
   
   However, after moving the wildcard expansion to the analyzer, we can no 
longer determine the projection expressions at this point. This causes the 
validations to fail, resulting in an invalid plan. Essentially, it causes 
another bug. 
   
   Consider the following case:
   ```
   create table t1(v1 int, v2 int);
   insert into t1 values (1,2),(2,4),(3,6);
   select * from t1 group by v1, v2 having max(v1)=3;
   ----
   +----+----+------------+
   | v1 | v2 | max(t1.v1) |
   +----+----+------------+
   | 3  | 6  | 3          |
   +----+----+------------+
   ```
   The aggregation field for the `having` clause shouldn't be printed. This 
issue relates to how wildcards are expanded based on a filter plan.
   
   I attempted to resolve this issue roughly but encountered another problem 
with error messages. Consider the SQL provided by @2010YOUY01:
   ```
   select * from t1 having max(v1)=3;
   ---
   Error: expand_wildcard_rule
   caused by
   Schema error: No field named t1.v1. Valid fields are "max(t1.v1)".
   ```
   In the previous version (41.0.0), the message would have been:
   ```
   > select * from t1 having max(v1)=3;
   Error during planning: Projection references non-aggregate values: 
Expression t1.v1 could not be resolved from available columns: max(t1.v1)
   ```
   
   I think we can expand the wildcard when planning the aggregation for 
validation if the group-by keys are empty. I can ensure that if the group-by 
keys are empty, the SQL isn't valid for the `having` clause, resulting in a 
better error message.
   
   However, if the group-by keys aren't empty and we don't expand the wildcard, 
it becomes difficult to provide a correct error message. The SQL will fail when 
invoking `ExpandWildCardRule`. For example:
   ```
   create table t1(v1 int, v2 int);
   insert into t1 values (1, 2),(2,3),(3,4);
   select * from t1 group by v1 having max(v1)=3
   ----
   Error: expand_wildcard_rule
   caused by
   Schema error: No field named t1.v2. Valid fields are t1.v1, "max(t1.v1)".
   ```
   In the previous version (41.0.0), the error message would have been:
   ```
   > select * from t2 group by v1 having max(v2) = 2;
   Error during planning: Projection references non-aggregate values: 
Expression t2.v2 could not be resolved from available columns: t2.v1, max(t2.v2)
   ```
   
   To avoid duplicate expansion, I don't prefer expanding the wildcard when the 
group-by keys aren't empty. However, this results in an unclear error message 
for the user 🤔. 
   
   I'll draft a PR to explain this more clearly.
   
   cc @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to