[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Weston Pace (Jira) Mon, 17 May 2021 10:38:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346298#comment-17346298
 ]


Weston Pace commented on ARROW-12358:
-------------------------------------

So looking on this with fresh eyes, the "overwrite mode" feature is fairly 
different from an "update" feature.  So I don't think update related topics are 
relevant for this ticket.  Update generally (and specifically in [~ldacey] 's 
case) implies reading and writing to the same set of files.   
Overwrite-partition mode wouldn't allow for that.  Overwrite-partition mode 
could be useful in some limited circumstances (e.g. somehow someone regenerates 
an entire new set of data for one or more partitions) but I think those are 
rare enough, and would be handled by a general "update" feature anyways, that I 
don't see much benefit in creating a separate feature and the complexity would 
just confuse users.

 

So I'll walk back my earlier comment.  I'd now argue that dataset write should 
only allow "append" and "error" options.

 

Dataset update could be created as a separate Jira ticket (I'll go ahead and 
draft one).  Dataset update would mean scanning and rewriting a dataset (or 
parts thereof).

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12358
>                 URL: https://issues.apache.org/jira/browse/ARROW-12358
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>             Fix For: 5.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Reply via email to