alamb commented on issue #9370:
URL: 
https://github.com/apache/arrow-datafusion/issues/9370#issuecomment-2003556263

   @mustafasrepo  also had some good thoughts from discord which I am copying 
here 
https://discord.com/channels/885562378132000778/1216380737410826270/1216684364805570560
   
   "
   
   I think, best approach would be Refactoring RepartitionExec such that it can 
parallelize hashing, when its input partition number is 1.
   
   The point of
   ``` 
   RepartitionExec(hash n_in=8, n_out=8)
   --RepartitionExec(round robin n_in=1, n_out=8)
   ```
   is to parallelizes hashing. Otherwise it is functionally equivalent to 
   
   ```
   RepartitionExec(hash n_in=1, n_out=8)
   ```
   
   For this reason, if we can parallelize hashing algorithm when input 
partition number is 1. That would enable us to use plan above without worry.
   I think, we primarily need to change implementation of `RepartitionExec`. 
Once `RepartitionExec` has this capacity,  we need to update 
EnforceDistribution rule to generate plan `Repartition(Hash n_in=1, n_out=8)` 
instead of `Repartition(hash n_in=8, n_out=8`), `Repartition(round robin 
n_in=1, n_out=8)` stack. However, this second stage is quite trivial. I am 
familiar with that part of the code.
   
   "


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to