[jira] [Created] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

Lance Dacey (Jira) Tue, 13 Apr 2021 06:59:07 -0700

Lance Dacey created ARROW-12365:
-----------------------------------

             Summary: [Python] [Dataset] Add partition_filename_cb to 
ds.write_dataset()
                 Key: ARROW-12365
                 URL: https://issues.apache.org/jira/browse/ARROW-12365
             Project: Apache Arrow
          Issue Type: Wish
          Components: Python
    Affects Versions: 3.0.0
         Environment: Ubuntu 18.04
            Reporter: Lance Dacey



I need to use the legacy pq.write_to_dataset() in order to guarantee that a 
file within a partition will have a specific name. 

My use case is that I need to report on the final version of data and our 
visualization tool connects directly to our parquet files on Azure Blob (Power 
BI).

1) Download data every hour based on an updated_at timestamp (this data is 
partitioned by date)

2) Transform the data which was just downloaded and save it into a "staging" 
dataset (this has all versions of the data and there will be many files within 
each partition. In this case, up to 24 files within a single date partition 
since we download hourly)

3) Filter the transformed data and read a subset of columns, sort it by the 
updated_at timestamp and drop duplicates on the unique constraint, then 
partition and save it with partition_filename_cb. In the example below, if I 
partition by the "date_id" column, then my dataset structure will be 
"/date_id=202104123/20210413.parquet"
{code:java}
        use_legacy_dataset=True,         partition_filename_cb=lambda x: 
str(x[-1]) + ".parquet",{code}
Ultimately, I am sure that this final dataset has exactly one file per 
partition and that I only have the latest version of each row based on the 
maximum updated_at timestamp. My visualization tool can safely connect to and 
incrementally refresh from this dataset.

 

 

An alternative solution would be to allow us to overwrite anything in an 
existing partition. I do not care about the file names so much as I want to 
ensure that I am fully replacing any data which might already exist in my 
partition, and I want to limit the number of physical files.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

Reply via email to