[GitHub] [spark] aokolnychyi edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

GitBox Wed, 15 Jul 2020 09:12:29 -0700


aokolnychyi edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658839477



   Oops, thanks for catching the corner case quickly, @dongjoon-hyun!
   
   My original idea for this PR was based on the fact that a range partitioning 
followed by a local sort is equivalent to a global sort if expressions are 
compatible. Then I started to generalize this idea and there was no obvious 
corner case. While this one is very subtle, I think it makes sense if we think 
more about it. Repartition nodes change data distribution but may not 
necessarily change the ordering of data (at least, there may be sorted chunks). 
Partially, this is the reason why we excluded coalesce in the original 
proposal. Based on the example above, this seems to be true even if we hash 
partition our data.
   
   I'd explore cases where sort+repartition are next to each other. In that 
case, we are sure we change both the ordering and distribution and can 
potentially ignore the sort below. 
   
   For example, we may have this:
   
   ```
   sql("select * from (select * from (select * from t order by b desc) 
distribute by a) order by b asc")
   ```
   
   ```
   Sort [b#6 ASC NULLS FIRST], true
   +- RepartitionByExpression [a#5], 4
      +- Sort [b#6 DESC NULLS LAST], true
         +- Repartition 2, true
            +- LocalRelation [a#5, b#6]
   ```
   
   Is there a case where we want to keep the first sort before 
RepartitionByExpression?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] aokolnychyi edited a comment on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

Reply via email to