[
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476405#comment-17476405
]
Weston Pace commented on ARROW-12358:
-------------------------------------
Ah, I think I see. We call something like...
{code}
fs.CreateDir(partition_dir);
if (delete_matching) {
fs.DeleteDirContents(partition_dir);
}
{code}
My guess is that ADLFS doesn't handle empty directories very well (I think we
have to create an empty file or something when working with S3) so the
fs.CreateDir operation is basically a no-op. Then, when we try to do
DeleteDirContents it cannot find the directory.
This is a bit of a tricky one. I wonder if we can come up with some kind of
workaround.
> [C++][Python][R][Dataset] Control overwriting vs appending when writing to
> existing dataset
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
> Fix For: 8.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}})
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when
> you are writing to an existing dataset, you de facto overwrite previous data
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by
> ensuring the file names are unique (the user can specify the
> {{basename_template}} to be something unique). There is also ARROW-7706 about
> silently doubling data (so _not_ overwriting existing data) with the legacy
> {{parquet.write_to_dataset}} implementation.
> It could be good to have a "mode" when writing datasets that controls the
> different possible behaviours. And erroring when there is pre-existing data
> in the target directory is maybe the safest default, because both appending
> vs overwriting silently can be surprising behaviour depending on your
> expectations.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)