[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320191#comment-17320191
 ] 

Joris Van den Bossche edited comment on ARROW-12358 at 4/13/21, 1:57 PM:
-------------------------------------------------------------------------

As mentioned by [~ldacey] in 
[ARROW-10695|https://issues.apache.org/jira/browse/ARROW-10695?focusedCommentId=17320167&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17320167]
 (and also in the comment on my [SO 
answer|https://stackoverflow.com/questions/67071323/how-to-control-whether-pyarrow-dataset-write-dataset-will-overwrite-previous-dat/67074697#67074697]),
 one of the consequences of the current default behaviour is that it will 
sometimes overwrite and sometimes append data, depending on what files are 
already present and how many parts your are writing.  
It would probably be useful to be able to either fully overwrite or either 
always append. 

Taking inspiration for possible "modes" from ARROW-7706:

* {{"overwrite"}}: overwrite existing data
** But should it clear all existing data first, or only overwrite files when 
file names match (i.e. basically the current behaviour)?
** Both behaviours might actually be useful depending on your use case?

* {{"append"}}: append new data to existing data
** But can we do this automatically with the default filename template? Because 
then if there are already {{part-0.parquet}} and {{part-1-.parquet}} files 
present in a certain partition, should it automatically infer the "current 
counter" to write a {{part-2.parquet"}}? (that seems rather complicated, 
especially if the max counter varies across partitions)

* {{"error"}}: raise an error if data already exists
** It can check if the specified base directory is empty or not (it's probably 
fine to have the directory itself already exist, as long as it is empty), and 
error if not empty.

In addition, it is mentioned that Spark also has a {{"ignore"}} option 
(silently ignore the write operation if data already exists), but not sure this 
is important to add.




was (Author: jorisvandenbossche):
As mentioned by [~ldacey] in 
[ARROW-10695|https://issues.apache.org/jira/browse/ARROW-10695?focusedCommentId=17320167&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17320167]
 (and also in the comment on my [SO 
answer|https://stackoverflow.com/questions/67071323/how-to-control-whether-pyarrow-dataset-write-dataset-will-overwrite-previous-dat/67074697#67074697]),
 one of the consequences of the current default behaviour is that it will 
sometimes overwrite and sometimes append data, depending on what files are 
already present and how many parts your are writing.  
It would probably be useful to be able to either fully overwrite or either 
always append. 

Taking inspiration for possible "modes" from ARROW-7706:

* {{"overwrite"}}: overwrite existing data
** But should it clear all existing data first, or only overwrite files when 
file names match (i.e. basically the current behaviour)?
** Both behaviours might actually be useful depending on your use case?

* {{"append"}}: append new data to existing data
** But can we do this automatically with the default filename template? Because 
then if there are already {{part-0.parquet}} and {{part-1-.parquet}} files 
present in a certain partition, should it automatically infer the "current 
counter" to write a {{part-2.parquet"}}? 

* {{"error"}}: raise an error if data already exists
** It can check if the specified base directory is empty or not (it's probably 
fine to have the directory itself already exist, as long as it is empty), and 
error if not empty.

In addition, it is mentioned that Spark also has a {{"ignore"}} option 
(silently ignore the write operation if data already exists), but not sure this 
is important to add.



> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12358
>                 URL: https://issues.apache.org/jira/browse/ARROW-12358
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>             Fix For: 5.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to