[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Lance Dacey (Jira) Thu, 12 Aug 2021 04:19:16 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398007#comment-17398007
 ]


Lance Dacey commented on ARROW-12358:
-------------------------------------

What is the common workflow pattern for folks trying to imitate something 
similar to a view in a database?

 

In many of my sources I have a dataset which is append only (using UUIDs in the 
basename template), normally partitioned by date. If this data is downloaded 
frequently or is generated from multiple sources (for example, several 
endpoints or servers), then each partition might have many files. Most likely 
there are also different versions of each row (one ID will have a row for each 
time it was updated, for example).

 

I then write to a new dataset which is used for reporting and visualization. 
 # Get the list of files which were saved to the append-only dataset during the 
most recent schedule
 # Create a dataset from the list of paths which were just saved and use 
.get_fragments() and ds._get_partition_keys(fragment.partition_expression) to 
generate a filter expression (this allows me to query for *all* of the data in 
each relevant partition which was recently modified - so if only a single row 
was modified in the 2021-08-05 partition, then I still need to read all of the 
other data in that partition in order to finalize it)
 # Create a dataframe, sort the data and drop duplicates on a primary key, 
convert back to a table (it would be nice to be able to do this purely in a 
pyarrow table so I could leave out pandas!)
 # Use pq.write_to_dataset() with partition_filename_cb=lambda x: str(x[-1]) + 
".parquet" to write to a final dataset. This allows me to overwrite the 
relevant partitions because the filenames are the same. I can be certain that I 
only have the latest version of each row.

 

This is my approach to come close to what I would achieve with a view in the 
database. It works fine, but the storage is essentially doubled since I am 
maintaining two datasets (append-only and final). Our visualization tool 
connects directly to these parquet files, so there is some benefit in having 
less files (one per partition instead of potentially hundreds) as well.

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12358
>                 URL: https://issues.apache.org/jira/browse/ARROW-12358
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>             Fix For: 6.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Reply via email to