[GitHub] spark pull request #21291: [SPARK-24242][SQL] RangeExec should have correct ...

viirya Wed, 16 May 2018 08:02:23 -0700

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21291#discussion_r188659776
  
    --- Diff: python/pyspark/sql/tests.py ---
    @@ -5239,8 +5239,8 @@ def test_complex_groupby(self):
             expected2 = df.groupby().agg(sum(df.v))
     
             # groupby one column and one sql expression
    -        result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v))
    -        expected3 = df.groupby(df.id, df.v % 2).agg(sum(df.v))
    +        result3 = df.groupby(df.id, df.v % 
2).agg(sum_udf(df.v)).orderBy(df.id, df.v % 2)
    --- End diff --
    
    Simply said, the data ordering between `result3` and `expect3` are 
different now.
    
    Previous query plan for two queries:
    ```
    == Physical Plan ==
    !AggregateInPandas [id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40], [sum(v#8)], 
[id#0L, (v#8 % 2.0)#40 AS (v % 2)#22, sum(v)#21 AS sum(v)#23]
    +- *(2) Sort [id#0L ASC NULLS FIRST, (v#8 % 2.0) AS (v#8 % 2.0)#40 ASC 
NULLS FIRST], false, 0
       +- Exchange hashpartitioning(id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40, 200)
          +- Generate explode(vs#4), [id#0L], false, [v#8]
             +- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)), 
(21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 + 
cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as 
double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0 
+ cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4]
                +- *(1) Range (0, 10, step=1, splits=8)
    ```
    
    ```
    == Physical Plan ==
    *(3) HashAggregate(keys=[id#0L, (v#8 % 2.0)#36], functions=[sum(v#8)], 
output=[id#0L, (v % 2)#31, sum(v)#32])
    +- Exchange hashpartitioning(id#0L, (v#8 % 2.0)#36, 200)
       +- *(2) HashAggregate(keys=[id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#36], 
functions=[partial_sum(v#8)], output=[id#0L, (v#8 % 2.0)#36, sum#38])
          +- Generate explode(vs#4), [id#0L], false, [v#8]
             +- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)), 
(21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 + 
cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as 
double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0 
+ cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4]
                +- *(1) Range (0, 10, step=1, splits=8)
    ```
    
    Both have `Exchange hashpartitioning` which produces the same data 
distribution previously. Notice `Sort` doesn't change data ordering because 200 
partitions make sparse distribution.
    
    Current query plan:
    ```
    !AggregateInPandas [id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#453], 
[sum(v#396)], [id#388L, (v#396 % 2.0)#453 AS (v % 2)#438, sum(v)#437 AS s
    um(v)#439]
    +- *(2) Sort [id#388L ASC NULLS FIRST, (v#396 % 2.0) AS (v#396 % 2.0)#453 
ASC NULLS FIRST], false, 0
       +- Generate explode(vs#392), [id#388L], false, [v#396]
          +- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)), 
(21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)),
     (23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0 
+ cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2
    7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 + 
cast(id#388L as double))) AS vs#392]
             +- *(1) Range (0, 10, step=1, splits=4)
    ```
    
    ```
    == Physical Plan ==
    *(2) HashAggregate(keys=[id#388L, (v#396 % 2.0)#454], 
functions=[sum(v#396)], output=[id#388L, (v % 2)#447, sum(v)#448])
    +- *(2) HashAggregate(keys=[id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#454], 
functions=[partial_sum(v#396)], output=[id#388L, (v#396 % 2.0)#45
    4, sum#456])
       +- Generate explode(vs#392), [id#388L], false, [v#396]
          +- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)), 
(21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)),
     (23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0 
+ cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2
    7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 + 
cast(id#388L as double))) AS vs#392]
             +- *(1) Range (0, 10, step=1, splits=4)
    ```
    
    `Exchange` is not there anymore. They have same data distribution. But now 
`Sort` changes data ordering.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21291: [SPARK-24242][SQL] RangeExec should have correct ...

Reply via email to