[GitHub] [arrow] ldacey commented on pull request #7921: ARROW-9658: [Python] Python bindings for dataset writing

GitBox Wed, 16 Sep 2020 11:23:32 -0700


ldacey commented on pull request #7921:
URL: https://github.com/apache/arrow/pull/7921#issuecomment-693579613



   Do think it is possible to add in support to repartition datasets? I am 
facing some issues with many small files just due to the frequency that I need 
to download data, which is compounded by the partitions. 
   
   I asked this on Jira as well but:
   
   1) I download data every 30 minutes from a source using UUID parquet 
filenames (each file just contains new or updated records since the last 
retrieval so I could not think of a good callback function name). This is 48 
parquet files per day.
   2) The data is then partitioned based on the created_date which creates even 
more files (some can be quite small)
   3) When I query the dataset, I need to read in a lot of small files.
   
   I would then want to read the data and repartition the files using a 
callback function so the dozens of files in partition ("date", "==", 
"2020-09-15") would become 2020-09-15.parquet, consolidated as a single file to 
keep things tidy. I know I can do this with Spark, but it would be nice to have 
a native pyarrow method.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ldacey commented on pull request #7921: ARROW-9658: [Python] Python bindings for dataset writing

Reply via email to