[
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320103#comment-17320103
]
Joris Van den Bossche commented on ARROW-10695:
-----------------------------------------------
For me it's clear that you can get collisions because we use identical names
when writing. Even aside the "multiple machines writing in parallel at the same
time", you can also "reproduce" it by simply writing twice to the same
directory. Dummy example:
{code:python}
>>> table1 = pa.table({'a': [1, 2, 3]})
>>> ds.write_dataset(table1, "test_dataset_overwrite", format="parquet")
>>> table2 = pa.table({'a': [4, 5, 6]})
>>> ds.write_dataset(table2, "test_dataset_overwrite", format="parquet")
>>> ds.dataset("test_dataset_overwrite/",
>>> format="parquet").to_table().to_pandas()
a
0 4
1 5
2 6
{code}
With the current API, for the above dummy example it's of course obvious that
it has overwritten the first one. But in the end, it's the same issue as what
[~ldacey] tries to explain with the multiple machines writing in parallel case.
If you want to avoid such overwriting of existing files, you need to ensure
that the writer uses unique file names, instead of using the default, fixed
{{part-\{0\}.ext}}. The eventual question is then;
- Do we leave this as the responsibility of the user (as [~ldacey] did with eg
{{basename_template = guid() + "-\{i\}.parquet"}}
- Or do we want to provide this as a bit of convenience that you can obtain
unique file names by allowing the {{\{uuid\}}} in the template string
Since the first option is also quite straightforward, it's maybe good enough to
leave this as the responsibility of the user (and we could add an example to do
this with python in the docs).
Unless we would want to make this the default, though, to include a uuid in the
base template (as I think eg spark does).
The topic here is also closely related to the issue of "overwriting" vs
"appending" a dataset (eg ARROW-7706, although that's for the legacy
ParquetDataset implementation)
> [C++][Dataset] Allow to use a UUID in the basename_template when writing a
> dataset
> ----------------------------------------------------------------------------------
>
> Key: ARROW-10695
> URL: https://issues.apache.org/jira/browse/ARROW-10695
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Minor
> Labels: dataset, dataset-parquet-write
> Fix For: 5.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can
> include a {{"\{i\}"}} part to replace it with an automatically incremented
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the
> file is unique in general (not only for a single write) and to mimic the
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and
> if present replace it for each file with a new UUID.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)