[GitHub] spark pull request #21564: [SPARK-24556][SQL] ReusedExchange should rewrite ...

yucai Thu, 14 Jun 2018 08:16:24 -0700

Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21564#discussion_r195463301
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
    @@ -170,6 +170,8 @@ case class InMemoryTableScanExec(
       override def outputPartitioning: Partitioning = {
         relation.cachedPlan.outputPartitioning match {
           case h: HashPartitioning => 
updateAttribute(h).asInstanceOf[HashPartitioning]
    +      case r: RangePartitioning =>
    +        r.copy(ordering = 
r.ordering.map(updateAttribute(_).asInstanceOf[SortOrder]))
    --- End diff --
    
    `PartitioningCollection` should be considered. Like below case:
    ```
    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
    spark.conf.set("spark.sql.codegen.wholeStage", false)
    val df1 = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("i", "j").as("t1")
    val df2 = Seq(1 -> "a", 3 -> "c", 2 -> "b").toDF("m", "n").as("t2")
    val d = df1.join(df2, $"t1.i" === $"t2.m")
    d.cache
    val d1 = d.as("t3")
    val d2 = d.as("t4")
    d1.join(d2, $"t3.i" === $"t4.i").explain
    ```
    ```
    SortMergeJoin [i#5], [i#54], Inner
    :- InMemoryTableScan [i#5, j#6, m#15, n#16]
    :     +- InMemoryRelation [i#5, j#6, m#15, n#16], CachedRDDBuilder
    :           +- SortMergeJoin [i#5], [m#15], Inner
    :              :- Sort [i#5 ASC NULLS FIRST], false, 0
    :              :  +- Exchange hashpartitioning(i#5, 10)
    :              :     +- LocalTableScan [i#5, j#6]
    :              +- Sort [m#15 ASC NULLS FIRST], false, 0
    :                 +- Exchange hashpartitioning(m#15, 10)
    :                    +- LocalTableScan [m#15, n#16]
    +- Sort [i#54 ASC NULLS FIRST], false, 0
       +- Exchange hashpartitioning(i#54, 10)
          +- InMemoryTableScan [i#54, j#55, m#58, n#59]
                +- InMemoryRelation [i#54, j#55, m#58, n#59], CachedRDDBuilder
                      +- SortMergeJoin [i#5], [m#15], Inner
                         :- Sort [i#5 ASC NULLS FIRST], false, 0
                         :  +- Exchange hashpartitioning(i#5, 10)
                         :     +- LocalTableScan [i#5, j#6]
                         +- Sort [m#15 ASC NULLS FIRST], false, 0
                            +- Exchange hashpartitioning(m#15, 10)
                               +- LocalTableScan [m#15, n#16]
    ```
    `Exchange hashpartitioning(i#54, 10)` is extra shuffle.
    
    How do you think?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21564: [SPARK-24556][SQL] ReusedExchange should rewrite ...

Reply via email to