kmitchener commented on issue #3629: URL: https://github.com/apache/arrow-datafusion/issues/3629#issuecomment-1260031829
So the use case I'm thinking of is something like this: * get a single large CSV file, read it in * run some transformations on it (simple cleaning/trim/reformatting of timestamps/adding other columns/data type casting, etc) * write output to parquet file, where I can control the number of files being created In this case, we want to leave `target_partitions` to default so it uses all cores for processing the transformations, but I want to control the number of parquet files being written. If I used a df.`repartition()` right before df.`write_parquet()`, it would work fine in this case, if the optimizer wasn't messing with it. (I actually can't think of any other use case for `repartition()` -- if it's not for this, what is it for?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
