kmitchener commented on issue #3629:
URL:
https://github.com/apache/arrow-datafusion/issues/3629#issuecomment-1259971237
In this particular case of the tpch binary, if you leave the default
partitions setting at 1, the physical plan is _just_ the `CsvExec` node and it
results in 1 file. (it's coming from 1 CSV file, so output partitions is just
1, resulting in a single file being written by write_parquet())
Based on this usage of Dataframe.repartition() and Andy's suggestion to use
repartition() in this same way in the Slack channel, I think that round-robin
splitting any input into N partitions is the desired use of repartition(). It
just doesn't actually work because the physical optimizer adds RepartitionExec
nodes after it to undo the work.
```
RepartitionExec: partitioning=RoundRobinBatch(20)
RepartitionExec: partitioning=RoundRobinBatch(4)
RepartitionExec: partitioning=RoundRobinBatch(20)
```
I was thinking to fix it by adding some logical to the Repartitioning bit of
the physical optimizer so that it won't add 2 repartitions in a row.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]