[GitHub] [arrow-datafusion] kmitchener opened a new issue, #3629: manual repartitioning overridden by physical optimizer

GitBox Tue, 27 Sep 2022 06:20:02 -0700


kmitchener opened a new issue, #3629:
URL: https://github.com/apache/arrow-datafusion/issues/3629


   **Describe the bug**
   
   Using dataframe.repartition() function doesn't work as expected.
   
   **To Reproduce**
   Using the tpch bin from benchmarks, convert the .tbl (csv) files to Parquet 
format using the "partitions" option:
   
   ```
   cargo run --release --bin tpch -- convert --input ./tpch-data --output 
./tpch-data-parquet --format parquet --partitions 4
   ```
   
   That should have produced 4 parquet files per table, but instead created 20 
(this laptop has 20 cores).
   
   **Expected behavior**
   
   Expected it to produce
   
   **Additional context**
   I added some debug and the physical plan produced is:
   ```
   CoalesceBatchesExec: target_batch_size=4096
     RepartitionExec: partitioning=RoundRobinBatch(20)
       RepartitionExec: partitioning=RoundRobinBatch(4)
         RepartitionExec: partitioning=RoundRobinBatch(20)
           CsvExec: 
files=[home/kmitchener/dev/arrow-datafusion/benchmarks/tpch-data/region.tbl], 
has_header=false, limit=None, projection=[r_regionkey, r_name, r_comment]
   ```
   
   The tpch bin repartitions the file using this bit of code:
   ```rust
           // optionally, repartition the file
           if opt.partitions > 1 {
               csv = 
csv.repartition(Partitioning::RoundRobinBatch(opt.partitions))?
           }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] kmitchener opened a new issue, #3629: manual repartitioning overridden by physical optimizer

Reply via email to