[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Lance Dacey (Jira) Fri, 14 Jan 2022 04:49:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476120#comment-17476120
 ]


Lance Dacey commented on ARROW-12358:
-------------------------------------

Ah, so it must be related to the filesystem. I am using adlfs / fsspec to save 
datasets on Azure Blob:


{code:python}
import pyarrow as pa
import pyarrow.dataset as ds

print(type(fs))
tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2, 3] })
ds.write_dataset(data=tab,
                 base_dir='/dev/newdataset',
                 partitioning_flavor='hive',
                 partitioning=['part'],
                 existing_data_behavior='delete_matching',
                 format='parquet',
                 filesystem=fs)
{code}

Output:


{code:python}
<class 'adlfs.spec.AzureBlobFileSystem'>

[2022-01-14 12:45:44,076] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,090] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,093] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,109] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,121] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,124] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_47/3075266795.py in <module>
      4 print(type(fs))
      5 tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2, 
3] })
----> 6 ds.write_dataset(data=tab,
      7                  base_dir='/dev/newdataset',
      8                  partitioning_flavor='hive',

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, 
partitioning_flavor, schema, filesystem, file_options, use_threads, 
max_partitions, file_visitor, existing_data_behavior)
    876         scanner = data
    877 
--> 878     _filesystemdataset_write(
    879         scanner, base_dir, basename_template, filesystem, partitioning,
    880         file_options, max_partitions, file_visitor, 
existing_data_behavior

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_delete_dir_contents()

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in 
delete_dir_contents(self, path)
    357             raise ValueError(
    358                 "delete_dir_contents called on path '", path, "'")
--> 359         self._delete_dir_contents(path)
    360 
    361     def delete_root_dir_contents(self):

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in 
_delete_dir_contents(self, path)
    347 
    348     def _delete_dir_contents(self, path):
--> 349         for subpath in self.fs.listdir(path, detail=False):
    350             if self.fs.isdir(subpath):
    351                 self.fs.rm(subpath, recursive=True)

/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/spec.py in 
listdir(self, path, detail, **kwargs)
   1221     def listdir(self, path, detail=True, **kwargs):
   1222         """Alias of `AbstractFileSystem.ls`."""
-> 1223         return self.ls(path, detail=detail, **kwargs)
   1224 
   1225     def cp(self, path1, path2, **kwargs):

/opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in ls(self, 
path, detail, invalidate_cache, delimiter, return_glob, **kwargs)
    753     ):
    754 
--> 755         files = sync(
    756             self.loop,
    757             self._ls,

/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in 
sync(loop, func, timeout, *args, **kwargs)
     69         raise FSTimeoutError from return_result
     70     elif isinstance(return_result, BaseException):
---> 71         raise return_result
     72     else:
     73         return return_result

/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in 
_runner(event, coro, result, timeout)
     23         coro = asyncio.wait_for(coro, timeout=timeout)
     24     try:
---> 25         result[0] = await coro
     26     except Exception as ex:
     27         result[0] = ex

/opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in _ls(self, 
path, invalidate_cache, delimiter, return_glob, **kwargs)
    875                     if not finalblobs:
    876                         if not await self._exists(target_path):
--> 877                             raise FileNotFoundError
    878                         return []
    879                     cache[target_path] = finalblobs

FileNotFoundError: 
{code}

Do you think I should raise this as an issue on the adlfs project instead?



> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12358
>                 URL: https://issues.apache.org/jira/browse/ARROW-12358
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>             Fix For: 8.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

Reply via email to