[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...

mgaido91 Fri, 09 Feb 2018 06:08:22 -0800

GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/20560


    [SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer

    ## What changes were proposed in this pull request?
    
    Added a new rule to remove Sort operation when its child is already sorted.
    For instance, this simple code:
    ```
    spark.sparkContext.parallelize(Seq(("a", "b"))).toDF("a", 
"b").registerTempTable("table1")
    val df = sql(s"""SELECT b
                    | FROM (
                    |     SELECT a, b
                    |     FROM table1
                    |     ORDER BY a
                    | ) t
                    | ORDER BY a""".stripMargin)
    df.explain(true)
    ```
    before the PR produces this plan:
    ```
    == Parsed Logical Plan ==
    'Sort ['a ASC NULLS FIRST], true
    +- 'Project ['b]
       +- 'SubqueryAlias t
          +- 'Sort ['a ASC NULLS FIRST], true
             +- 'Project ['a, 'b]
                +- 'UnresolvedRelation `table1`
    
    == Analyzed Logical Plan ==
    b: string
    Project [b#7]
    +- Sort [a#6 ASC NULLS FIRST], true
       +- Project [b#7, a#6]
          +- SubqueryAlias t
             +- Sort [a#6 ASC NULLS FIRST], true
                +- Project [a#6, b#7]
                   +- SubqueryAlias table1
                      +- Project [_1#3 AS a#6, _2#4 AS b#7]
                         +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS 
_1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, 
true, false) AS _2#4]
                            +- ExternalRDD [obj#2]
    
    == Optimized Logical Plan ==
    Project [b#7]
    +- Sort [a#6 ASC NULLS FIRST], true
       +- Project [b#7, a#6]
          +- Sort [a#6 ASC NULLS FIRST], true
             +- Project [_1#3 AS a#6, _2#4 AS b#7]
                +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS 
_2#4]
                   +- ExternalRDD [obj#2]
    
    == Physical Plan ==
    *(3) Project [b#7]
    +- *(3) Sort [a#6 ASC NULLS FIRST], true, 0
       +- Exchange rangepartitioning(a#6 ASC NULLS FIRST, 200)
          +- *(2) Project [b#7, a#6]
             +- *(2) Sort [a#6 ASC NULLS FIRST], true, 0
                +- Exchange rangepartitioning(a#6 ASC NULLS FIRST, 200)
                   +- *(1) Project [_1#3 AS a#6, _2#4 AS b#7]
                      +- *(1) SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS 
_2#4]
                         +- Scan ExternalRDDScan[obj#2]
    ```
    
    while after the PR produces:
    
    ```
    == Parsed Logical Plan ==
    'Sort ['a ASC NULLS FIRST], true
    +- 'Project ['b]
       +- 'SubqueryAlias t
          +- 'Sort ['a ASC NULLS FIRST], true
             +- 'Project ['a, 'b]
                +- 'UnresolvedRelation `table1`
    
    == Analyzed Logical Plan ==
    b: string
    Project [b#7]
    +- Sort [a#6 ASC NULLS FIRST], true
       +- Project [b#7, a#6]
          +- SubqueryAlias t
             +- Sort [a#6 ASC NULLS FIRST], true
                +- Project [a#6, b#7]
                   +- SubqueryAlias table1
                      +- Project [_1#3 AS a#6, _2#4 AS b#7]
                         +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS 
_1#3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, 
true, false) AS _2#4]
                            +- ExternalRDD [obj#2]
    
    == Optimized Logical Plan ==
    Project [b#7]
    +- Sort [a#6 ASC NULLS FIRST], true
       +- Project [_1#3 AS a#6, _2#4 AS b#7]
          +- SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS 
_2#4]
             +- ExternalRDD [obj#2]
    
    == Physical Plan ==
    *(2) Project [b#7]
    +- *(2) Sort [a#6 ASC NULLS FIRST], true, 0
       +- Exchange rangepartitioning(a#6 ASC NULLS FIRST, 5)
          +- *(1) Project [_1#3 AS a#6, _2#4 AS b#7]
             +- *(1) SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#3, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, assertnotnull(input[0, scala.Tuple2, true])._2, true, false) AS 
_2#4]
                +- Scan ExternalRDDScan[obj#2]
    ```
    
    this means that an unnecessary sort operation is not performed after the PR.
    
    ## How was this patch tested?
    
    added UT


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-23375

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20560.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20560
    
----
commit 550ff99652de515e9ee056596350a8cbf802f938
Author: Marco Gaido <marcogaido91@...>
Date:   2018-02-09T13:57:08Z

    [SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...

Reply via email to