viirya commented on a change in pull request #26978: [SPARK-29721][SQL] Prune
unnecessary nested fields from Generate without Project
URL: https://github.com/apache/spark/pull/26978#discussion_r367673756
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -155,6 +155,49 @@ object NestedColumnAliasing {
case MapType(keyType, valueType, _) => totalFieldNum(keyType) +
totalFieldNum(valueType)
case _ => 1 // UDT and others
}
+}
+
+/**
+ * This prunes unnessary nested columns from `Generate` and optional `Project`
on top
+ * of it.
+ */
+object GeneratorNestedColumnAliasing {
+ def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
+ case Project(projectList, g: Generate) if
(SQLConf.get.nestedPruningOnExpressions ||
+ SQLConf.get.nestedSchemaPruningEnabled) &&
canPruneGenerator(g.generator) =>
Review comment:
One reason to add `nestedSchemaPruningEnabled` here is, we cannot just push
through Generate (the next patten case) without this Project + Generate case.
If so, we will hit a failure query plan that there is nested column accessor
on top Project which is not pruned through, but the other nested column at
Generate is pruned through it to its child. Then the nested column accessor on
the top Project is unresolvable.
E.g.:
!Project [a.b, col]
+ Generate [explode(gen_alias#123), col]
+ Project [a.c as gen_alias#123]
+ Scan [a:<c:array<int>>]
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]