kmitchener opened a new issue, #3629:
URL: https://github.com/apache/arrow-datafusion/issues/3629
**Describe the bug**
Using dataframe.repartition() function doesn't work as expected.
**To Reproduce**
Using the tpch bin from benchmarks, convert the .tbl (csv) files to Parquet
format using the "partitions" option:
```
cargo run --release --bin tpch -- convert --input ./tpch-data --output
./tpch-data-parquet --format parquet --partitions 4
```
That should have produced 4 parquet files per table, but instead created 20
(this laptop has 20 cores).
**Expected behavior**
Expected it to produce
**Additional context**
I added some debug and the physical plan produced is:
```
CoalesceBatchesExec: target_batch_size=4096
RepartitionExec: partitioning=RoundRobinBatch(20)
RepartitionExec: partitioning=RoundRobinBatch(4)
RepartitionExec: partitioning=RoundRobinBatch(20)
CsvExec:
files=[home/kmitchener/dev/arrow-datafusion/benchmarks/tpch-data/region.tbl],
has_header=false, limit=None, projection=[r_regionkey, r_name, r_comment]
```
The tpch bin repartitions the file using this bit of code:
```rust
// optionally, repartition the file
if opt.partitions > 1 {
csv =
csv.repartition(Partitioning::RoundRobinBatch(opt.partitions))?
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]