GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/14865

    [SPARK-17289][SQL] Fix a bug to satisfy sort requirements in partial 
aggregations

    ## What changes were proposed in this pull request?
    Partial aggregations are generated in `EnsureRequirements`, but the planner 
fails to
    check if partial aggregation satisfies sort requirements.
    For the following query:
    ```
    val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", 
"b").createOrReplaceTempView("t2")
    spark.sql("select max(b) from t2 group by a").explain(true)
    ```
    Now, the SortAggregator won't insert Sort operator before partial 
aggregation, this will break sort-based partial aggregation.
    ```
    == Physical Plan ==
    SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
    +- *Sort [a#5 ASC], false, 0
       +- Exchange hashpartitioning(a#5, 200)
          +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], 
output=[a#5, max#19])
             +- LocalTableScan [a#5, b#6]
    ```
    Actually, a correct plan is:
    ```
    == Physical Plan ==
    SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
    +- *Sort [a#5 ASC], false, 0
       +- Exchange hashpartitioning(a#5, 200)
          +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], 
output=[a#5, max#19])
             +- *Sort [a#5 ASC], false, 0
                +- LocalTableScan [a#5, b#6]
    ```
    
    ## How was this patch tested?
    Added tests in `PlannerSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark SPARK-17289

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14865
    
----
commit a0cc9866a6494d5e4dd663e70e38158a3b9a45c7
Author: Takeshi YAMAMURO <linguin....@gmail.com>
Date:   2016-08-29T16:12:26Z

    Fix bug in partial aggregations

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to