[spark] branch master updated: [SPARK-35767][SQL] Avoid executing child plan twice in CoalesceExec

dongjoon Tue, 15 Jun 2021 12:00:25 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 1012967  [SPARK-35767][SQL] Avoid executing child plan twice in 
CoalesceExec
1012967 is described below

commit 1012967ace4c7bd4e5a6f59c6ea6eec45871f292
Author: Andy Grove <andygrov...@gmail.com>
AuthorDate: Tue Jun 15 11:59:21 2021 -0700

    [SPARK-35767][SQL] Avoid executing child plan twice in CoalesceExec
    
    ### What changes were proposed in this pull request?
    
    `CoalesceExec` needlessly calls `child.execute` twice when it could just 
call it once and re-use the results. This only happens when `numPartitions == 
1`.
    
    ### Why are the changes needed?
    
    It is more efficient to execute the child plan once rather than twice.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    There are no functional changes. This is just a performance optimization, 
so the existing tests should be sufficient to catch any regressions.
    
    Closes #32920 from andygrove/coalesce-exec-executes-twice.
    
    Authored-by: Andy Grove <andygrov...@gmail.com>
    Signed-off-by: Dongjoon Hyun <dongj...@apache.org>
---
 .../org/apache/spark/sql/execution/basicPhysicalOperators.scala      | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
index b537040..8c51cde 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
@@ -724,12 +724,13 @@ case class CoalesceExec(numPartitions: Int, child: 
SparkPlan) extends UnaryExecN
   }
 
   protected override def doExecute(): RDD[InternalRow] = {
-    if (numPartitions == 1 && child.execute().getNumPartitions < 1) {
+    val rdd = child.execute()
+    if (numPartitions == 1 && rdd.getNumPartitions < 1) {
       // Make sure we don't output an RDD with 0 partitions, when claiming 
that we have a
       // `SinglePartition`.
       new CoalesceExec.EmptyRDDWithPartitions(sparkContext, numPartitions)
     } else {
-      child.execute().coalesce(numPartitions, shuffle = false)
+      rdd.coalesce(numPartitions, shuffle = false)
     }
   }
 

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-35767][SQL] Avoid executing child plan twice in CoalesceExec

Reply via email to