Lance Dacey created ARROW-12365:
-----------------------------------
Summary: [Python] [Dataset] Add partition_filename_cb to
ds.write_dataset()
Key: ARROW-12365
URL: https://issues.apache.org/jira/browse/ARROW-12365
Project: Apache Arrow
Issue Type: Wish
Components: Python
Affects Versions: 3.0.0
Environment: Ubuntu 18.04
Reporter: Lance Dacey
I need to use the legacy pq.write_to_dataset() in order to guarantee that a
file within a partition will have a specific name.
My use case is that I need to report on the final version of data and our
visualization tool connects directly to our parquet files on Azure Blob (Power
BI).
1) Download data every hour based on an updated_at timestamp (this data is
partitioned by date)
2) Transform the data which was just downloaded and save it into a "staging"
dataset (this has all versions of the data and there will be many files within
each partition. In this case, up to 24 files within a single date partition
since we download hourly)
3) Filter the transformed data and read a subset of columns, sort it by the
updated_at timestamp and drop duplicates on the unique constraint, then
partition and save it with partition_filename_cb. In the example below, if I
partition by the "date_id" column, then my dataset structure will be
"/date_id=202104123/20210413.parquet"
{code:java}
use_legacy_dataset=True, partition_filename_cb=lambda x:
str(x[-1]) + ".parquet",{code}
Ultimately, I am sure that this final dataset has exactly one file per
partition and that I only have the latest version of each row based on the
maximum updated_at timestamp. My visualization tool can safely connect to and
incrementally refresh from this dataset.
An alternative solution would be to allow us to overwrite anything in an
existing partition. I do not care about the file names so much as I want to
ensure that I am fully replacing any data which might already exist in my
partition, and I want to limit the number of physical files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)