[GitHub] [arrow-datafusion] kmitchener commented on issue #3629: manual repartitioning overridden by physical optimizer

GitBox Tue, 27 Sep 2022 12:45:29 -0700


kmitchener commented on issue #3629:
URL: 
https://github.com/apache/arrow-datafusion/issues/3629#issuecomment-1259971237


   In this particular case of the tpch binary, if you leave the default 
partitions setting at 1, the physical plan is _just_ the `CsvExec` node and it 
results in 1 file. (it's coming from 1 CSV file, so output partitions is just 
1, resulting in a single file being written by write_parquet())
   
   Based on this usage of Dataframe.repartition() and Andy's suggestion to use 
repartition() in this same way in the Slack channel, I think that round-robin 
splitting any input into N partitions is the desired use of repartition(). It 
just doesn't actually work because the physical optimizer adds RepartitionExec 
nodes after it to undo the work.
   
   ```
     RepartitionExec: partitioning=RoundRobinBatch(20)
       RepartitionExec: partitioning=RoundRobinBatch(4)
         RepartitionExec: partitioning=RoundRobinBatch(20)
   ```
   
   I was thinking to fix it by adding some logical to the Repartitioning bit of 
the physical optimizer so that it won't add 2 repartitions in a row. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] kmitchener commented on issue #3629: manual repartitioning overridden by physical optimizer

Reply via email to