[GitHub] [arrow-datafusion] houqp commented on pull request #1143: Add output_partitions_size for CoalescePartitionsExec

GitBox Mon, 18 Oct 2021 22:13:09 -0700


houqp commented on pull request #1143:
URL: https://github.com/apache/arrow-datafusion/pull/1143#issuecomment-946372811



   I think the main difference is coalesce doesn't perform any shuffle while 
repartition does it depending on the partitioning scheme.   This distinction 
comes from spark's dataframe api: 
https://www.hadoopinrealworld.com/what-is-the-difference-between-repartition-and-coalesce-in-spark/.
   
   In theory, we could implement coalesce using RepartitionExec at the physical 
layer by adding a new partitioning scheme, but it might complicate the code 
there and resulting in slightly more overhead for both operations. That said, 
if we can come up with a zero overhead and clean implementation of 
CoalescePartitionsExec within RepartitionExec, then I am 100% onboard with 
merging them :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] houqp commented on pull request #1143: Add output_partitions_size for CoalescePartitionsExec

Reply via email to