c21 commented on pull request #29473:
URL: https://github.com/apache/spark/pull/29473#issuecomment-683245934


   > did you observe any patterns or heuristics on your workloads where 
repartition is preferred?
   
   From our side, honestly now we don't have any automation for deciding 
coalesce vs repartition. We provided configs similar here for users themselves 
to control coalesce vs repartition.
   
   I think a rule of thumb can be we don't want to
   (1).coalesce: if the coalesced table is too big and # of coalesced buckets 
is too few, then each task has too much data and will take more time.
   (2).repartition: if the repartition table is too big and # of repartitioned 
buckets is too many, then too much duplicated data is read and will have too 
much more CPU/IO cost (might be worse than just shuffling this table). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to