[jira] [Reopened] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

Joris Van den Bossche (Jira) Mon, 05 Jul 2021 05:53:06 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche reopened ARROW-12365:
-------------------------------------------

> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> ------------------------------------------------------------------
>
>                 Key: ARROW-12365
>                 URL: https://issues.apache.org/jira/browse/ARROW-12365
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 3.0.0
>         Environment: Ubuntu 18.04
>            Reporter: Lance Dacey
>            Priority: Major
>              Labels: dataset, parquet, python
>             Fix For: 5.0.0
>
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a 
> file within a partition will have a specific name. 
> My use case is that I need to report on the final version of data and our 
> visualization tool connects directly to our parquet files on Azure Blob 
> (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is 
> partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging" 
> dataset (this has all versions of the data and there will be many files 
> within each partition. In this case, up to 24 files within a single date 
> partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the 
> updated_at timestamp and drop duplicates on the unique constraint, then 
> partition and save it with partition_filename_cb. In the example below, if I 
> partition by the "date_id" column, then my dataset structure will be 
> "/date_id=202104123/20210413.parquet"
> {code:java}
>         use_legacy_dataset=True,         partition_filename_cb=lambda x: 
> str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per 
> partition and that I only have the latest version of each row based on the 
> maximum updated_at timestamp. My visualization tool can safely connect to and 
> incrementally refresh from this dataset.
>  
>  
> An alternative solution would be to allow us to overwrite anything in an 
> existing partition. I do not care about the file names so much as I want to 
> ensure that I am fully replacing any data which might already exist in my 
> partition, and I want to limit the number of physical files.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

Reply via email to