[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

Lance Dacey (Jira) Wed, 17 Feb 2021 10:02:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286036#comment-17286036
 ]


Lance Dacey commented on ARROW-10695:
-------------------------------------

Perhaps this has changed, but I was running into issues when writing to a 
dataset in parallel. 

For example, I use Airflow to extract data from 6 different servers in parallel 
(separate tasks are used to download data from each source "extract_cms_1", 
"extract_cms_2") using turbodbc which fetches the data in pyarrow tables --> 
this data is written to Azure Blob using ds.write_dataset()

I noticed that the part-{i} names were clashing when this happened. part-0 
would be replaced a few times for example, and it seemed random or hinted at 
race conditions. I have another Airflow DAG which is downloading from 74 
different REST APIs as well (the downloads can happen simultaneously but the 
source and credentials used are different per account).

Adding the guid() to the filenames solved that issue for me. 

Is there a separate issue open for the partition_filename_cb to be added to 
ds.write_dataset()? I have been using that feature to "repartition" Dataset A 
with many small files into Dataset B with one file per partition (larger 
physical file, less fragments).

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-10695
>                 URL: https://issues.apache.org/jira/browse/ARROW-10695
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Minor
>              Labels: dataset, dataset-parquet-write
>             Fix For: 4.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

Reply via email to