[
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476120#comment-17476120
]
Lance Dacey commented on ARROW-12358:
-------------------------------------
Ah, so it must be related to the filesystem. I am using adlfs / fsspec to save
datasets on Azure Blob:
{code:python}
import pyarrow as pa
import pyarrow.dataset as ds
print(type(fs))
tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2, 3] })
ds.write_dataset(data=tab,
base_dir='/dev/newdataset',
partitioning_flavor='hive',
partitioning=['part'],
existing_data_behavior='delete_matching',
format='parquet',
filesystem=fs)
{code}
Output:
{code:python}
<class 'adlfs.spec.AzureBlobFileSystem'>
[2022-01-14 12:45:44,076] {api.py:76} WARNING - Given content is empty,
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,090] {api.py:76} WARNING - Given content is empty,
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,093] {api.py:76} WARNING - Given content is empty,
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,109] {api.py:76} WARNING - Given content is empty,
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,121] {api.py:76} WARNING - Given content is empty,
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,124] {api.py:76} WARNING - Given content is empty,
stopping the process very early, returning empty utf_8 str match
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_47/3075266795.py in <module>
4 print(type(fs))
5 tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2,
3] })
----> 6 ds.write_dataset(data=tab,
7 base_dir='/dev/newdataset',
8 partitioning_flavor='hive',
/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/dataset.py in
write_dataset(data, base_dir, basename_template, format, partitioning,
partitioning_flavor, schema, filesystem, file_options, use_threads,
max_partitions, file_visitor, existing_data_behavior)
876 scanner = data
877
--> 878 _filesystemdataset_write(
879 scanner, base_dir, basename_template, filesystem, partitioning,
880 file_options, max_partitions, file_visitor,
existing_data_behavior
/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_dataset.pyx in
pyarrow._dataset._filesystemdataset_write()
/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_fs.pyx in
pyarrow._fs._cb_delete_dir_contents()
/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in
delete_dir_contents(self, path)
357 raise ValueError(
358 "delete_dir_contents called on path '", path, "'")
--> 359 self._delete_dir_contents(path)
360
361 def delete_root_dir_contents(self):
/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in
_delete_dir_contents(self, path)
347
348 def _delete_dir_contents(self, path):
--> 349 for subpath in self.fs.listdir(path, detail=False):
350 if self.fs.isdir(subpath):
351 self.fs.rm(subpath, recursive=True)
/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/spec.py in
listdir(self, path, detail, **kwargs)
1221 def listdir(self, path, detail=True, **kwargs):
1222 """Alias of `AbstractFileSystem.ls`."""
-> 1223 return self.ls(path, detail=detail, **kwargs)
1224
1225 def cp(self, path1, path2, **kwargs):
/opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in ls(self,
path, detail, invalidate_cache, delimiter, return_glob, **kwargs)
753 ):
754
--> 755 files = sync(
756 self.loop,
757 self._ls,
/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in
sync(loop, func, timeout, *args, **kwargs)
69 raise FSTimeoutError from return_result
70 elif isinstance(return_result, BaseException):
---> 71 raise return_result
72 else:
73 return return_result
/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in
_runner(event, coro, result, timeout)
23 coro = asyncio.wait_for(coro, timeout=timeout)
24 try:
---> 25 result[0] = await coro
26 except Exception as ex:
27 result[0] = ex
/opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in _ls(self,
path, invalidate_cache, delimiter, return_glob, **kwargs)
875 if not finalblobs:
876 if not await self._exists(target_path):
--> 877 raise FileNotFoundError
878 return []
879 cache[target_path] = finalblobs
FileNotFoundError:
{code}
Do you think I should raise this as an issue on the adlfs project instead?
> [C++][Python][R][Dataset] Control overwriting vs appending when writing to
> existing dataset
> -------------------------------------------------------------------------------------------
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
> Fix For: 8.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}})
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when
> you are writing to an existing dataset, you de facto overwrite previous data
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by
> ensuring the file names are unique (the user can specify the
> {{basename_template}} to be something unique). There is also ARROW-7706 about
> silently doubling data (so _not_ overwriting existing data) with the legacy
> {{parquet.write_to_dataset}} implementation.
> It could be good to have a "mode" when writing datasets that controls the
> different possible behaviours. And erroring when there is pre-existing data
> in the target directory is maybe the safest default, because both appending
> vs overwriting silently can be surprising behaviour depending on your
> expectations.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)