[
https://issues.apache.org/jira/browse/ARROW-16204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche resolved ARROW-16204.
-------------------------------------------
Resolution: Fixed
Issue resolved by pull request 12898
[https://github.com/apache/arrow/pull/12898]
> [C++][Dataset] Default error existing_data_behaviour for writing dataset
> ignores a single file
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-16204
> URL: https://issues.apache.org/jira/browse/ARROW-16204
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: dataset, pull-request-available
> Fix For: 8.0.0
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> While trying to understand a failing test in
> https://github.com/apache/arrow/pull/12811#discussion_r851128672, I noticed
> that the {{write_dataset}} function does not actually always raise an error
> by default if there is already existing data in the target location.
> The documentation says it will raise "if any data exists in the destination"
> (which is also what I would expect), but in practice it seems that it does
> ignore certain file names:
> {code:python}
> import pyarrow.dataset as ds
> table = pa.table({'a': [1, 2, 3]})
> # write a first time to new directory: OK
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a second time to the same directory: passes, but should raise?
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> >>> !ls test_overwrite
> part-0.parquet
> # write a another time to the same directory with different name: still passes
> >>> ds.write_dataset(table, "test_overwrite", format="parquet",
> >>> basename_template="data-{i}.parquet")
> >>> !ls test_overwrite
> data-0.parquet part-0.parquet
> # now writing again finally raises an error
> >>> ds.write_dataset(table, "test_overwrite", format="parquet")
> ...
> ArrowInvalid: Could not write to test_overwrite as the directory is not empty
> and existing_data_behavior is to error
> {code}
> So it seems that when checking if existing data exists, it seems to ignore
> any files that match the basename template pattern.
> cc [~westonpace] do you know if this was intentional? (I would find that a
> strange corner case, and in any case it is also not documented)
--
This message was sent by Atlassian Jira
(v8.20.7#820007)