Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

via GitHub Mon, 18 Dec 2023 07:25:27 -0800


pan3793 commented on PR #44352:
URL: https://github.com/apache/spark/pull/44352#issuecomment-1860806255


   > It seems we have an optimization that if the ORDER BY expression directly 
matches something from the SELECT list, we replace it with AttributReference. 
Can you find out where the optimization is? And when it will be triggered? Is 
there rule-order independent?
   
   The plan change log shows it indeed caused by `ResolveReferences`(which 
invokes the virtual rule `ResolveReferencesInSort`), thus I suppose this answer 
is valid https://github.com/apache/spark/pull/44352#discussion_r1427566432. it 
should be rule-order independent
   
   <details>
   <summary>master plan change log</summary>
   
   ```
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
        'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
        +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], 
[unresolvedalias('udf('b)), unresolvedali...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], 
[unresolvedalias('u...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
              'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
              +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], 
[unresolvedalias...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], 
true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                    +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), 
unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                               
'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                                                                               
+- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggre...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                                                                        +- 
'UnresolvedHaving ('udf('b) = 3)
       +- Aggregate [b#178, c#179], [c...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf('b) = 3)                                         
                                                                        +- 
'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)        
                                                                        +- 
'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as 
string)) as int) = 3)                                                    +- 
Filter (udf(b)#181 = 3)
       +- Aggregate [b#178, ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        
'Project [udf(b)#181, udf(c)#182]
   !+- Filter (udf(b)#181 = 3)                                                  
                                                                        +- 
'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true
   !   +- Aggregate ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
   !'Project [udf(b)#181, udf(c)#182]                                           
                                                                                
         Project [udf(b)#181, udf(c)#182]
   !+- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true   
                                                                                
         +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
    Project [udf(b)#181, udf(c)#182]                                            
                                                                                
         Project [udf(b)#181, udf(c)#182]
    +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, 
cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true               
                   +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS 
FIRST, cas...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Result of Batch Resolution ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
        Project [udf(b)#181, udf(c)#182]
   !+- 'UnresolvedHaving ('udf('b) = 3)                                         
        +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, 
cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]      +- Filter (udf(b)#181 = 3)
   !      +- 'UnresolvedRelat...
   ```
   </details>
   
   <details>
   <summary>this patch plan change log</summary>
   
   ```
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
        'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
        +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]      +- 'Aggregate ['b, 'c], 
[unresolvedalias('udf('b)), unresolvedali...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.execution.datasources.FindDataSourceTable ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                +- 'UnresolvedHaving ('udf('b) = 3)
       +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]              +- 'Aggregate ['b, 'c], 
[unresolvedalias('u...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
              'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
              +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]            +- 'Aggregate [b#178, c#179], 
[unresolvedalias...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], 
true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                    +- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), 
unresolvedalias('udf(c#179))]      +- 'Aggregate [b#178, c#179],...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                               
'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                                                                               
+- 'UnresolvedHaving ('udf('b) = 3)
   !   +- 'Aggre...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
    +- 'UnresolvedHaving ('udf('b) = 3)                                         
                                                                        +- 
'UnresolvedHaving ('udf('b) = 3)
       +- Aggregate [b#178, c#179], [c...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf('b) = 3)                                         
                                                                        +- 
'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3)        
                                                                        +- 
'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
    'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true
   !+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as 
string)) as int) = 3)                                                    +- 
Filter (udf(b)#181 = 3)
       +- Aggregate [b#178, ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
                                                                        'Sort 
['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 
'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true
    +- Filter (udf(b)#181 = 3)                                                  
                                                                  ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
   !'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 
'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true                
  Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) 
ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) 
as string) ASC NULLS FIRST], true
    +- Filter (udf(b)#181 = 3)                                                  
         ...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
   !Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) 
ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) 
as string) ASC NULLS FIRST], true   Sort [udf(b)#181 ASC NULLS FIRST, 
udf(c)#182 ASC NULLS FIRST], true
    +- Filter (udf(b)#181 = 3)                                                  
                                                                              
...
   org.apache.spark.sql.catalyst.rules.PlanChangeLogger:
   === Result of Batch Resolution ===
   !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true            
        Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true
   !+- 'UnresolvedHaving ('udf('b) = 3)                                         
        +- Filter (udf(b)#181 = 3)
   !   +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), 
unresolvedalias('udf('c))]      +- Aggregate [b#178, c#179], 
[cast(udf(cast(b#178 as string)) as int) AS udf(b)#181, cast(udf(cast(c#179 as 
string)) as str...
   ```
   </details>
   
   > I think in Spark the extension is we can push down grouping expressions 
and aggregate functions from ORDER BY to SELECT.
   
   BTW, I think this may be out scope of this PR. The key point of this PR is, 
`ORDER BY` should have the same resolve result whether `HAVING` clause is 
present or not.
   
   ```patch
     SELECT xxx
     FROM xxx
     GROUP BY xxx
   + HAVING xxx
     ORDER BY xxx
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-28386][SQL] Cannot resolve ORDER BY columns with GROUP BY and HAVING [spark]

Reply via email to