[ https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reopened ARROW-12365: ------------------------------------------- > [Python] [Dataset] Add partition_filename_cb to ds.write_dataset() > ------------------------------------------------------------------ > > Key: ARROW-12365 > URL: https://issues.apache.org/jira/browse/ARROW-12365 > Project: Apache Arrow > Issue Type: Wish > Components: Python > Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 > Reporter: Lance Dacey > Priority: Major > Labels: dataset, parquet, python > Fix For: 5.0.0 > > > I need to use the legacy pq.write_to_dataset() in order to guarantee that a > file within a partition will have a specific name. > My use case is that I need to report on the final version of data and our > visualization tool connects directly to our parquet files on Azure Blob > (Power BI). > 1) Download data every hour based on an updated_at timestamp (this data is > partitioned by date) > 2) Transform the data which was just downloaded and save it into a "staging" > dataset (this has all versions of the data and there will be many files > within each partition. In this case, up to 24 files within a single date > partition since we download hourly) > 3) Filter the transformed data and read a subset of columns, sort it by the > updated_at timestamp and drop duplicates on the unique constraint, then > partition and save it with partition_filename_cb. In the example below, if I > partition by the "date_id" column, then my dataset structure will be > "/date_id=202104123/20210413.parquet" > {code:java} > use_legacy_dataset=True, partition_filename_cb=lambda x: > str(x[-1]) + ".parquet",{code} > Ultimately, I am sure that this final dataset has exactly one file per > partition and that I only have the latest version of each row based on the > maximum updated_at timestamp. My visualization tool can safely connect to and > incrementally refresh from this dataset. > > > An alternative solution would be to allow us to overwrite anything in an > existing partition. I do not care about the file names so much as I want to > ensure that I am fully replacing any data which might already exist in my > partition, and I want to limit the number of physical files. > -- This message was sent by Atlassian Jira (v8.3.4#803005)