maytasm commented on PR #37330:
URL: https://github.com/apache/spark/pull/37330#issuecomment-1836761095
@ulysses-you @cloud-fan
Not sure if I am missing something but can this causes performance
degradation if my sort order is on a key with few/single values?
For example, if I have 500 shuffle partitions...
Without this change:
```
Sort local
Sort global
```
both of the above stages would run with 500 tasks
With this change: and say the RepartitionByExpression is on the column date
and there is only a single value for this in my dataset
```
Sort local
RepartitionByExpression
```
RepartitionByExpression will run with 500 tasks and create a single partition
then the Sort would run with a single task (as there is only a single
partition)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]