ldacey commented on pull request #7921:
URL: https://github.com/apache/arrow/pull/7921#issuecomment-693579613
Do think it is possible to add in support to repartition datasets? I am
facing some issues with many small files just due to the frequency that I need
to download data, which is compounded by the partitions.
I asked this on Jira as well but:
1) I download data every 30 minutes from a source using UUID parquet
filenames (each file just contains new or updated records since the last
retrieval so I could not think of a good callback function name). This is 48
parquet files per day.
2) The data is then partitioned based on the created_date which creates even
more files (some can be quite small)
3) When I query the dataset, I need to read in a lot of small files.
I would then want to read the data and repartition the files using a
callback function so the dozens of files in partition ("date", "==",
"2020-09-15") would become 2020-09-15.parquet, consolidated as a single file to
keep things tidy. I know I can do this with Spark, but it would be nice to have
a native pyarrow method.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]