[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Lance Dacey (Jira) Tue, 13 Apr 2021 07:30:53 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320221#comment-17320221
 ]


Lance Dacey commented on ARROW-12358:
-------------------------------------

I think that having an "overwrite" option would satisfy my need for the 
partition_filename_cb  https://issues.apache.org/jira/browse/ARROW-12365 if we 
can replace _all_ data inside the partition. This would be great for file 
compaction as well - we could read a dataset with a lot of tiny file fragments 
and then overwrite it.

Overwriting a specific file is also useful. For example, my basename_template 
is usually my f"\{task-id}-\{schedule-timestamp}-\{file-count}-\{i}.parquet". I 
am able clear a task and overwrite a file which already exists. The only flaw 
here is that we cannot control the \{i} variable so I guess it is not 
guaranteed. I could live without this.

For "append", is it possible for the counter to be per partition instead 
(potential race conditions if multiple tasks write to the same partition in 
parallel perhaps, and it seems to be a more demanding step for large 
datasets..)? Or could the \{i} variable optionally be a uuid instead of the 
fragment count?

"error" makes sense. 

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12358
>                 URL: https://issues.apache.org/jira/browse/ARROW-12358
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>             Fix For: 5.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Reply via email to