Lance Dacey created ARROW-12365: ----------------------------------- Summary: [Python] [Dataset] Add partition_filename_cb to ds.write_dataset() Key: ARROW-12365 URL: https://issues.apache.org/jira/browse/ARROW-12365 Project: Apache Arrow Issue Type: Wish Components: Python Affects Versions: 3.0.0 Environment: Ubuntu 18.04 Reporter: Lance Dacey
I need to use the legacy pq.write_to_dataset() in order to guarantee that a file within a partition will have a specific name. My use case is that I need to report on the final version of data and our visualization tool connects directly to our parquet files on Azure Blob (Power BI). 1) Download data every hour based on an updated_at timestamp (this data is partitioned by date) 2) Transform the data which was just downloaded and save it into a "staging" dataset (this has all versions of the data and there will be many files within each partition. In this case, up to 24 files within a single date partition since we download hourly) 3) Filter the transformed data and read a subset of columns, sort it by the updated_at timestamp and drop duplicates on the unique constraint, then partition and save it with partition_filename_cb. In the example below, if I partition by the "date_id" column, then my dataset structure will be "/date_id=202104123/20210413.parquet" {code:java} use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code} Ultimately, I am sure that this final dataset has exactly one file per partition and that I only have the latest version of each row based on the maximum updated_at timestamp. My visualization tool can safely connect to and incrementally refresh from this dataset. An alternative solution would be to allow us to overwrite anything in an existing partition. I do not care about the file names so much as I want to ensure that I am fully replacing any data which might already exist in my partition, and I want to limit the number of physical files. -- This message was sent by Atlassian Jira (v8.3.4#803005)