[GitHub] [arrow-datafusion] kmitchener commented on issue #3629: manual repartitioning overridden by physical optimizer

GitBox Tue, 27 Sep 2022 13:48:30 -0700


kmitchener commented on issue #3629:
URL: 
https://github.com/apache/arrow-datafusion/issues/3629#issuecomment-1260031829


   So the use case I'm thinking of is something like this:
   * get a single large CSV file, read it in
   * run some transformations on it (simple cleaning/trim/reformatting of 
timestamps/adding other columns/data type casting, etc)
   * write output to parquet file, where I can control the number of files 
being created
   
   In this case, we want to leave `target_partitions` to default so it uses all 
cores for processing the transformations, but I want to control the number of 
parquet files being written. If I used a df.`repartition()` right before 
df.`write_parquet()`, it would work fine in this case, if the optimizer wasn't 
messing with it. (I actually can't think of any other use case for 
`repartition()` -- if it's not for this, what is it for?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] kmitchener commented on issue #3629: manual repartitioning overridden by physical optimizer

Reply via email to