[
https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335292#comment-17335292
]
Lance Dacey commented on ARROW-12365:
-------------------------------------
@jorisvandenbossche I will close this issue in favor of an overwrite option for
partitions since that is the only reason I use the partition_filename_cb
https://issues.apache.org/jira/browse/ARROW-12358
> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> ------------------------------------------------------------------
>
> Key: ARROW-12365
> URL: https://issues.apache.org/jira/browse/ARROW-12365
> Project: Apache Arrow
> Issue Type: Wish
> Components: Python
> Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
> Reporter: Lance Dacey
> Priority: Major
> Labels: dataset, parquet, python
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a
> file within a partition will have a specific name.
> My use case is that I need to report on the final version of data and our
> visualization tool connects directly to our parquet files on Azure Blob
> (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is
> partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging"
> dataset (this has all versions of the data and there will be many files
> within each partition. In this case, up to 24 files within a single date
> partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the
> updated_at timestamp and drop duplicates on the unique constraint, then
> partition and save it with partition_filename_cb. In the example below, if I
> partition by the "date_id" column, then my dataset structure will be
> "/date_id=202104123/20210413.parquet"
> {code:java}
> use_legacy_dataset=True, partition_filename_cb=lambda x:
> str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per
> partition and that I only have the latest version of each row based on the
> maximum updated_at timestamp. My visualization tool can safely connect to and
> incrementally refresh from this dataset.
>
>
> An alternative solution would be to allow us to overwrite anything in an
> existing partition. I do not care about the file names so much as I want to
> ensure that I am fully replacing any data which might already exist in my
> partition, and I want to limit the number of physical files.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)