Allison Wang created SPARK-32216: ------------------------------------ Summary: Remove redundant ProjectExec Key: SPARK-32216 URL: https://issues.apache.org/jira/browse/SPARK-32216 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Allison Wang
Currently Spark executed plan can have redundant `ProjectExec` node. For example: After Filter: {code:java} == Physical Plan == *(1) Project [a#14L, b#15L, c#16, key#17] +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5)) +- *(1) ColumnarToRow +- FileScan parquet [a#14L,b#15L,c#16,key#17] {code} The `Project [a#14L, b#15L, c#16, key#17]` is redundant because its output is exactly the same as filter's output. Before Aggregate: {code:java} == Physical Plan == *(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L]) +- Exchange hashpartitioning(key#17, 5), true, [id=#77] +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51]) +- *(1) Project [key#17, a#14L, b#15L] +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100)) +- *(1) ColumnarToRow +- FileScan parquet [a#14L,b#15L,key#17] {code} The `Project [key#17, a#14L, b#15L]` is redundant because hash aggregate doesn't require child plan's output to be in a specific order. In general, a project is redundant when # It has the same output attributes and order as its child's output when ordering of these attributes is required. # It has the same output attributes as its child's output when attribute output ordering is not required. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org