Stefan Lange created ARROW-10611:
------------------------------------
Summary: Deletion of existing file when write_table fails
Key: ARROW-10611
URL: https://issues.apache.org/jira/browse/ARROW-10611
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 2.0.0, 1.0.1
Environment: Docker container with Jupyter notebook; RHEL 7.6; POSIX
compliant filesystem
Reporter: Stefan Lange
[https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737arrow/python/pyarrow/parquet.py|https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737]
Line 1737 in
[47f2e0c|https://github.com/apache/arrow/commit/47f2e0cb03ed8ad265e0688ada8162bf46066483]
| |os.remove(_stringify_path(where))|
When write_table encounters a problem, the exception handler removes the
attempted output parquet file (see snippet below).
This logic makes sense in order to make sure no file with inconsistent
content/state remains.
However, if a file with the same name already exists, it gets also deleted.
Would it make sense to add an option to let the user choose the behaviour in
such a case, e. g. to choose to keep an existing file and to only overwrite it
if the action is successful?
And/or: Would it make sense to check early if the intended file can be written
and fail early if that is not the case (without deleting a preexisting file)?
E. g. if the directory has permission 755 and the already existing file has
permission 444, then the write attempt fails with a PermissionError but the
exception handler deletes the preexisting file. This behaviour is a bit
counterintuitive?
Or would you say the responsibility lies with the people setting the
file/directory permissions right?
{{except Exception:}}
{{ if _is_path_like(where):}}
{{ try:}}
{{ os.remove(_stringify_path(where))}}
{{ except os.error:}}
{{ pass}}
{{}}
As [https://github.com/chr1st1ank] added in
[https://github.com/apache/arrow/issues/8607]:
This can be reproduced with the following commands in ipython. Pandas uses
pyarrow to read/write Parquet.
In effect the attempt to write to a file without write permissions to it
results in the deletion of this file (of course only if the user has
permissions enough on the directory to delete the file).
>> import pandas as pd
>>
>> df = pd.DataFrame(data=\{'col1': [1, 2], 'col2': [3, 4]})
>> df.to_parquet('df.parquet.gzip', compression='gzip')
>> pd.read_parquet('df.parquet.gzip')
>> !ls -l 'df.parquet.gzip'
-rw-r--r-- 1 myuser domain users 1529 Nov 13 09:31 df.parquet.gzip
>> !chmod 000 'df.parquet.gzip'
>> df.to_parquet('df.parquet.gzip', compression='gzip')
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
<ipython-input-10-584c5c8752e0> in <module>
----> 1 df.to_parquet('df.parquet.gzip', compression='gzip')
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/core/frame.py in
to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
2220 index=index,
2221 partition_cols=partition_cols,
-> 2222 **kwargs
2223 )
2224
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in
to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
252 index=index,
253 partition_cols=partition_cols,
--> 254 **kwargs
255 )
256
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in
write(self, df, path, compression, coerce_timestamps, index, partition_cols,
**kwargs)
115 compression=compression,
116 coerce_timestamps=coerce_timestamps,
--> 117 **kwargs
118 )
119
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in
write_table(table, where, row_group_size, version, use_dictionary, compression,
use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps,
flavor, **kwargs)
1122 compression=compression,
1123 use_deprecated_int96_timestamps=use_int96,
-> 1124 **kwargs) as writer:
1125 writer.write_table(table, row_group_size=row_group_size)
1126 except Exception:
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in
__init__(self, where, schema, flavor, version, use_dictionary, compression,
use_deprecated_int96_timestamps, **options)
338 if _is_path_like(where):
339 fs = _get_fs_from_path(where)
--> 340 sink = self.file_handle = fs.open(where, 'wb')
341 else:
342 sink = where
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/filesystem.py in
open(self, path, mode)
243 """
244 path = _stringify_path(path)
--> 245 return open(path, mode=mode)
246
247 @property
PermissionError: [Errno 13] Permission denied: 'df.parquet.gzip'
>> !ls -l 'df.parquet.gzip'
ls: cannot access df.parquet.gzip: No such file or directory
--
This message was sent by Atlassian Jira
(v8.3.4#803005)