[ 
https://issues.apache.org/jira/browse/ARROW-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196967#comment-17196967
 ] 

Lance Dacey commented on ARROW-9682:
------------------------------------

Excellent. [~jorisvandenbossche] will there be a way to potentially repartition 
datasets? My use case is this:

1) I download data every 30 minutes from a source using UUID parquet filenames 
(each file just contains new or updated records since the last hour so I could 
not think of a good callback function name). This is 48 parquet files per day.
2) The data is then partitioned based on the created_date which creates even 
more files (some can be quite small)
3) When I query the dataset, I need to read in a lot of very small files.

I would then want to read the data and repartition the files using a callback 
function so the dozens of files in partition ("date", "==", "2020-09-15") would 
become 2020-09-15.parquet, consolidated as a single file to keep things tidy. I 
know I can do this with Spark, but it would be nice to have a native pyarrow 
method.

> [Python] Unable to specify the partition style with pq.write_to_dataset
> -----------------------------------------------------------------------
>
>                 Key: ARROW-9682
>                 URL: https://issues.apache.org/jira/browse/ARROW-9682
>             Project: Apache Arrow
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>         Environment: Ubuntu 18.04
> Python 3.7
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: dataset-parquet-write, parquet, parquetWriter
>
> I am able to import and test DirectoryPartitioning but I am not able to 
> figure out a way to write a dataset using this feature. It seems like 
> write_to_dataset defaults to the "hive" style. Is there a way to test this?
> {code:java}
> from pyarrow.dataset import DirectoryPartitioning
> partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), 
> ("month", pa.int8()), ("day", pa.int8())]))
> print(partitioning.parse("/2009/11/3"))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to