[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #1143: Add output_partitions_size for CoalescePartitionsExec

GitBox Mon, 18 Oct 2021 23:26:09 -0700


Dandandan edited a comment on pull request #1143:
URL: https://github.com/apache/arrow-datafusion/pull/1143#issuecomment-946377551



   In Spark, repartition is using `coalesce` by setting parameter 
`shuffle=true`.
   I think it might be cleaner to keep the `RepartitionExec` and 
`CoalescePartitionsExec` separated, otherwise you get two implementations in 
the same code without too much sharing?
   
   For implementing `CoalescePartitionsExec` we just have to have a scheme that 
combines partitions within `execute` (e.g. when reducing the number of 
partitions from 8 to 4 we can return partitions 0,1 for `execute(0)` 2,3 for 
`execute(1)` etc.
   For Ballista, we have to now (explicitly or implicitly) what partitions are 
living on what node to avoid shuffles.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #1143: Add output_partitions_size for CoalescePartitionsExec

Reply via email to