Joris Van den Bossche created ARROW-12358:
---------------------------------------------
Summary: [C++][Python][R][Dataset] Control overwriting vs
appending when writing to existing dataset
Key: ARROW-12358
URL: https://issues.apache.org/jira/browse/ARROW-12358
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Fix For: 5.0.0
Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} uses
a fixed filename template ({{"part\{i\}.ext"}}). This means that when you are
writing to an existing dataset, you de facto overwrite previous data when using
this default template.
There is some discussion in ARROW-10695 about how the user can avoid this by
ensuring the file names are unique (the user can specify the
{{basename_template}} to be something unique). There is also ARROW-7706 about
silently doubling data (so _not_ overwriting existing data) with the legacy
{{parquet.write_to_dataset}} implementation.
It could be good to have a "mode" when writing datasets that controls the
different possible behaviours. And erroring when there is pre-existing data in
the target directory is maybe the safest default, because both appending vs
overwriting silently can be surprising behaviour depending on your expectations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)