Joris Van den Bossche created ARROW-16204:
---------------------------------------------

             Summary: [C++][Dataset] Default error existing_data_behaviour for 
writing dataset ignores "part-{i}.ext" files 
                 Key: ARROW-16204
                 URL: https://issues.apache.org/jira/browse/ARROW-16204
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche
             Fix For: 8.0.0


While trying to understand a failing test in 
https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed 
that the {{write_dataset}} function does not actually always raise an error by 
default if there is already existing data in the target location.

The documentation says it will raise "if any data exists in the destination" 
(which is also what I would expect), but in practice it seems that it does 
ignore certain file names:

{code:python}
import pyarrow.dataset as ds
table = pa.table({'a': [1, 2, 3]})

# write a first time to new directory: OK
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a second time to the same directory: passes, but should raise?
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
>>> !ls test_overwrite
part-0.parquet

# write a another time to the same directory with different name: still passes
>>> ds.write_dataset(table, "test_overwrite", format="parquet", 
>>> basename_template="data-{i}.parquet")
>>> !ls test_overwrite
data-0.parquet  part-0.parquet

# now writing again finally raises an error
>>> ds.write_dataset(table, "test_overwrite", format="parquet")
...
ArrowInvalid: Could not write to test_overwrite as the directory is not empty 
and existing_data_behavior is to error
{code}

So it seems that when checking if existing data exists, it seems to ignore any 
files that match the basename template pattern.

cc [~westonpace] do you know if this was intentional? (I would find that a 
strange corner case, and in any case it is also not documented)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to