c21 commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683245934
> did you observe any patterns or heuristics on your workloads where repartition is preferred? From our side, honestly now we don't have any automation for deciding coalesce vs repartition. We provided configs similar here for users themselves to control coalesce vs repartition. I think a rule of thumb can be we don't want to (1).coalesce: if the coalesced table is too big and # of coalesced buckets is too few, then each task has too much data and will take more time. (2).repartition: if the repartition table is too big and # of repartitioned buckets is too many, then too much duplicated data is read and will have too much more CPU/IO cost (might be worse than just shuffling this table). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
