Joris Van den Bossche created ARROW-12358:
---------------------------------------------

             Summary: [C++][Python][R][Dataset] Control overwriting vs 
appending when writing to existing dataset
                 Key: ARROW-12358
                 URL: https://issues.apache.org/jira/browse/ARROW-12358
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche
             Fix For: 5.0.0


Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} uses 
a fixed filename template ({{"part\{i\}.ext"}}). This means that when you are 
writing to an existing dataset, you de facto overwrite previous data when using 
this default template.

There is some discussion in ARROW-10695 about how the user can avoid this by 
ensuring the file names are unique (the user can specify the 
{{basename_template}} to be something unique). There is also ARROW-7706 about 
silently doubling data (so _not_ overwriting existing data) with the legacy 
{{parquet.write_to_dataset}} implementation. 

It could be good to have a "mode" when writing datasets that controls the 
different possible behaviours. And erroring when there is pre-existing data in 
the target directory is maybe the safest default, because both appending vs 
overwriting silently can be surprising behaviour depending on your expectations.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to