[
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398007#comment-17398007
]
Lance Dacey commented on ARROW-12358:
-------------------------------------
What is the common workflow pattern for folks trying to imitate something
similar to a view in a database?
In many of my sources I have a dataset which is append only (using UUIDs in the
basename template), normally partitioned by date. If this data is downloaded
frequently or is generated from multiple sources (for example, several
endpoints or servers), then each partition might have many files. Most likely
there are also different versions of each row (one ID will have a row for each
time it was updated, for example).
I then write to a new dataset which is used for reporting and visualization.
# Get the list of files which were saved to the append-only dataset during the
most recent schedule
# Create a dataset from the list of paths which were just saved and use
.get_fragments() and ds._get_partition_keys(fragment.partition_expression) to
generate a filter expression (this allows me to query for *all* of the data in
each relevant partition which was recently modified - so if only a single row
was modified in the 2021-08-05 partition, then I still need to read all of the
other data in that partition in order to finalize it)
# Create a dataframe, sort the data and drop duplicates on a primary key,
convert back to a table (it would be nice to be able to do this purely in a
pyarrow table so I could leave out pandas!)
# Use pq.write_to_dataset() with partition_filename_cb=lambda x: str(x[-1]) +
".parquet" to write to a final dataset. This allows me to overwrite the
relevant partitions because the filenames are the same. I can be certain that I
only have the latest version of each row.
This is my approach to come close to what I would achieve with a view in the
database. It works fine, but the storage is essentially doubled since I am
maintaining two datasets (append-only and final). Our visualization tool
connects directly to these parquet files, so there is some benefit in having
less files (one per partition instead of potentially hundreds) as well.
> [C++][Python][R][Dataset] Control overwriting vs appending when writing to
> existing dataset
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
> Fix For: 6.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when
> you are writing to an existing dataset, you de facto overwrite previous data
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by
> ensuring the file names are unique (the user can specify the
> {{basename_template}} to be something unique). There is also ARROW-7706 about
> silently doubling data (so _not_ overwriting existing data) with the legacy
> {{parquet.write_to_dataset}} implementation.
> It could be good to have a "mode" when writing datasets that controls the
> different possible behaviours. And erroring when there is pre-existing data
> in the target directory is maybe the safest default, because both appending
> vs overwriting silently can be surprising behaviour depending on your
> expectations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)