[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

Joris Van den Bossche (Jira) Tue, 13 Apr 2021 04:53:53 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320103#comment-17320103
 ]


Joris Van den Bossche commented on ARROW-10695:
-----------------------------------------------

For me it's clear that you can get collisions because we use identical names 
when writing. Even aside the "multiple machines writing in parallel at the same 
time", you can also "reproduce" it by simply writing twice to the same 
directory. Dummy example:

{code:python}
>>> table1 = pa.table({'a': [1, 2, 3]})
>>> ds.write_dataset(table1, "test_dataset_overwrite", format="parquet")
>>> table2 = pa.table({'a': [4, 5, 6]})
>>> ds.write_dataset(table2, "test_dataset_overwrite", format="parquet")
>>> ds.dataset("test_dataset_overwrite/", 
>>> format="parquet").to_table().to_pandas()
   a
0  4
1  5
2  6
{code}

With the current API, for the above dummy example it's of course obvious that 
it has overwritten the first one. But in the end, it's the same issue as what 
[~ldacey] tries to explain with the multiple machines writing in parallel case.

If you want to avoid such overwriting of existing files, you need to ensure 
that the writer uses unique file names, instead of using the default, fixed 
{{part-\{0\}.ext}}. The eventual question is then;

- Do we leave this as the responsibility of the user (as [~ldacey] did with eg 
{{basename_template = guid() + "-\{i\}.parquet"}}
- Or do we want to provide this as a bit of convenience that you can obtain 
unique file names by allowing the {{\{uuid\}}} in the template string

Since the first option is also quite straightforward, it's maybe good enough to 
leave this as the responsibility of the user (and we could add an example to do 
this with python in the docs). 
Unless we would want to make this the default, though, to include a uuid in the 
base template (as I think eg spark does).

The topic here is also closely related to the issue of "overwriting" vs 
"appending" a dataset (eg ARROW-7706, although that's for the legacy 
ParquetDataset implementation)

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-10695
>                 URL: https://issues.apache.org/jira/browse/ARROW-10695
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Minor
>              Labels: dataset, dataset-parquet-write
>             Fix For: 5.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

Reply via email to