pan3793 commented on PR #44352: URL: https://github.com/apache/spark/pull/44352#issuecomment-1860806255
> It seems we have an optimization that if the ORDER BY expression directly matches something from the SELECT list, we replace it with AttributReference. Can you find out where the optimization is? And when it will be triggered? Is there rule-order independent? The plan change log shows it indeed caused by `ResolveReferences`(which invokes the virtual rule `ResolveReferencesInSort`), thus I suppose this answer is valid https://github.com/apache/spark/pull/44352#discussion_r1427566432. it should be rule-order independent <details> <summary>master plan change log</summary> ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- 'Aggregate ['b, 'c], [unresolvedalias('u... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) ! +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- 'Aggregate [b#178, c#179], [unresolvedalias... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) ! +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))] +- 'Aggregate [b#178, c#179],... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) ! +- 'Aggre... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) +- Aggregate [b#178, c#179], [c... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true !+- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true !+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3) +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true !+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3) +- Filter (udf(b)#181 = 3) +- Aggregate [b#178, ... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Project [udf(b)#181, udf(c)#182] !+- Filter (udf(b)#181 = 3) +- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true ! +- Aggregate ... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions === !'Project [udf(b)#181, udf(c)#182] Project [udf(b)#181, udf(c)#182] !+- 'Sort ['udf(b#178) ASC NULLS FIRST, 'udf(c#179) ASC NULLS FIRST], true +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS ... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone === Project [udf(b)#181, udf(c)#182] Project [udf(b)#181, udf(c)#182] +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cas... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Result of Batch Resolution === !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true Project [udf(b)#181, udf(c)#182] !+- 'UnresolvedHaving ('udf('b) = 3) +- Sort [cast(udf(cast(b#178 as string)) as int) ASC NULLS FIRST, cast(udf(cast(c#179 as string)) as string) ASC NULLS FIRST], true ! +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- Filter (udf(b)#181 = 3) ! +- 'UnresolvedRelat... ``` </details> <details> <summary>this patch plan change log</summary> ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedali... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- 'Aggregate ['b, 'c], [unresolvedalias('u... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) ! +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- 'Aggregate [b#178, c#179], [unresolvedalias... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) ! +- 'Aggregate [b#178, c#179], [unresolvedalias('udf(b#178)), unresolvedalias('udf(c#179))] +- 'Aggregate [b#178, c#179],... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) ! +- 'Aggre... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.ResolveTimeZone === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true +- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf('b) = 3) +- Aggregate [b#178, c#179], [c... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true !+- 'UnresolvedHaving ('udf('b) = 3) +- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, fals... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true !+- 'UnresolvedHaving ('udf(tempresolvedcolumn(b#178, b, false)) = 3) +- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178,... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions === 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true !+- 'UnresolvedHaving (cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) = 3) +- Filter (udf(b)#181 = 3) +- Aggregate [b#178, ... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true 'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true +- Filter (udf(b)#181 = 3) ... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions === !'Sort ['udf(tempresolvedcolumn(b#178, b, false)) ASC NULLS FIRST, 'udf(tempresolvedcolumn(c#179, c, false)) ASC NULLS FIRST], true Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true +- Filter (udf(b)#181 = 3) ... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions === !Sort [cast(udf(cast(tempresolvedcolumn(b#178, b, false) as string)) as int) ASC NULLS FIRST, cast(udf(cast(tempresolvedcolumn(c#179, c, false) as string)) as string) ASC NULLS FIRST], true Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true +- Filter (udf(b)#181 = 3) ... org.apache.spark.sql.catalyst.rules.PlanChangeLogger: === Result of Batch Resolution === !'Sort ['udf('b) ASC NULLS FIRST, 'udf('c) ASC NULLS FIRST], true Sort [udf(b)#181 ASC NULLS FIRST, udf(c)#182 ASC NULLS FIRST], true !+- 'UnresolvedHaving ('udf('b) = 3) +- Filter (udf(b)#181 = 3) ! +- 'Aggregate ['b, 'c], [unresolvedalias('udf('b)), unresolvedalias('udf('c))] +- Aggregate [b#178, c#179], [cast(udf(cast(b#178 as string)) as int) AS udf(b)#181, cast(udf(cast(c#179 as string)) as str... ``` </details> > I think in Spark the extension is we can push down grouping expressions and aggregate functions from ORDER BY to SELECT. BTW, I think this may be out scope of this PR. The key point of this PR is, `ORDER BY` should have the same resolve result whether `HAVING` clause is present or not. ```patch SELECT xxx FROM xxx GROUP BY xxx + HAVING xxx ORDER BY xxx ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
