Re: [PR] [SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression [spark]

via GitHub Fri, 01 Dec 2023 12:52:54 -0800


maytasm commented on PR #37330:
URL: https://github.com/apache/spark/pull/37330#issuecomment-1836761095


   @ulysses-you @cloud-fan 
   Not sure if I am missing something but can this causes performance 
degradation if my sort order is on a key with few/single values?
   For example, if I have 500 shuffle partitions...
   Without this change:
   ```
   Sort local    
     Sort global   
   ```
    both of the above stages would run with 500 tasks
   With this change: and say the RepartitionByExpression is on the column date 
and there is only a single value for this in my dataset
   ```
   Sort local 
      RepartitionByExpression
   ```
   RepartitionByExpression will run with 500 tasks and create a single partition
   then the Sort would run with a single task (as there is only a single 
partition)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-39911][SQL] Optimize global Sort to RepartitionByExpression [spark]

Reply via email to