aokolnychyi edited a comment on pull request #29089:
URL: https://github.com/apache/spark/pull/29089#issuecomment-658839477
Oops, thanks for catching the corner case quickly, @dongjoon-hyun!
My original idea for this PR was based on the fact that a range partitioning
followed by a local sort is equivalent to a global sort if expressions are
compatible. Then I started to generalize this idea and there was no obvious
corner case. While this one is very subtle, I think it makes sense if we think
more about it. Repartition nodes change data distribution but may not
necessarily change the ordering of data (at least, there may be sorted chunks).
Partially, this is the reason why we excluded coalesce in the original
proposal. Based on the example above, this seems to be true even if we hash
partition our data.
I'd explore cases where sort+repartition are next to each other. In that
case, we are sure we change both the ordering and distribution and can
potentially ignore the sort below.
For example, we may have this:
```
sql("select * from (select * from (select * from t order by b desc)
distribute by a) order by b asc")
```
```
Sort [b#6 ASC NULLS FIRST], true
+- RepartitionByExpression [a#5], 4
+- Sort [b#6 DESC NULLS LAST], true
+- Repartition 2, true
+- LocalRelation [a#5, b#6]
```
Is there a case where we want to keep the first sort before
RepartitionByExpression?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]