houqp commented on pull request #1143: URL: https://github.com/apache/arrow-datafusion/pull/1143#issuecomment-946372811
I think the main difference is coalesce doesn't perform any shuffle while repartition does it depending on the partitioning scheme. This distinction comes from spark's dataframe api: https://www.hadoopinrealworld.com/what-is-the-difference-between-repartition-and-coalesce-in-spark/. In theory, we could implement coalesce using RepartitionExec at the physical layer by adding a new partitioning scheme, but it might complicate the code there and resulting in slightly more overhead for both operations. That said, if we can come up with a zero overhead and clean implementation of CoalescePartitionsExec within RepartitionExec, then I am 100% onboard with merging them :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
