alamb commented on issue #9370: URL: https://github.com/apache/arrow-datafusion/issues/9370#issuecomment-2003556263
@mustafasrepo also had some good thoughts from discord which I am copying here https://discord.com/channels/885562378132000778/1216380737410826270/1216684364805570560 " I think, best approach would be Refactoring RepartitionExec such that it can parallelize hashing, when its input partition number is 1. The point of ``` RepartitionExec(hash n_in=8, n_out=8) --RepartitionExec(round robin n_in=1, n_out=8) ``` is to parallelizes hashing. Otherwise it is functionally equivalent to ``` RepartitionExec(hash n_in=1, n_out=8) ``` For this reason, if we can parallelize hashing algorithm when input partition number is 1. That would enable us to use plan above without worry. I think, we primarily need to change implementation of `RepartitionExec`. Once `RepartitionExec` has this capacity, we need to update EnforceDistribution rule to generate plan `Repartition(Hash n_in=1, n_out=8)` instead of `Repartition(hash n_in=8, n_out=8`), `Repartition(round robin n_in=1, n_out=8)` stack. However, this second stage is quite trivial. I am familiar with that part of the code. " -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
