Stefan Lange created ARROW-10611: ------------------------------------ Summary: Deletion of existing file when write_table fails Key: ARROW-10611 URL: https://issues.apache.org/jira/browse/ARROW-10611 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 2.0.0, 1.0.1 Environment: Docker container with Jupyter notebook; RHEL 7.6; POSIX compliant filesystem Reporter: Stefan Lange
[https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737arrow/python/pyarrow/parquet.py|https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737] Line 1737 in [47f2e0c|https://github.com/apache/arrow/commit/47f2e0cb03ed8ad265e0688ada8162bf46066483] | |os.remove(_stringify_path(where))| When write_table encounters a problem, the exception handler removes the attempted output parquet file (see snippet below). This logic makes sense in order to make sure no file with inconsistent content/state remains. However, if a file with the same name already exists, it gets also deleted. Would it make sense to add an option to let the user choose the behaviour in such a case, e. g. to choose to keep an existing file and to only overwrite it if the action is successful? And/or: Would it make sense to check early if the intended file can be written and fail early if that is not the case (without deleting a preexisting file)? E. g. if the directory has permission 755 and the already existing file has permission 444, then the write attempt fails with a PermissionError but the exception handler deletes the preexisting file. This behaviour is a bit counterintuitive? Or would you say the responsibility lies with the people setting the file/directory permissions right? {{except Exception:}} {{ if _is_path_like(where):}} {{ try:}} {{ os.remove(_stringify_path(where))}} {{ except os.error:}} {{ pass}} {{}} As [https://github.com/chr1st1ank] added in [https://github.com/apache/arrow/issues/8607]: This can be reproduced with the following commands in ipython. Pandas uses pyarrow to read/write Parquet. In effect the attempt to write to a file without write permissions to it results in the deletion of this file (of course only if the user has permissions enough on the directory to delete the file). >> import pandas as pd >> >> df = pd.DataFrame(data=\{'col1': [1, 2], 'col2': [3, 4]}) >> df.to_parquet('df.parquet.gzip', compression='gzip') >> pd.read_parquet('df.parquet.gzip') >> !ls -l 'df.parquet.gzip' -rw-r--r-- 1 myuser domain users 1529 Nov 13 09:31 df.parquet.gzip >> !chmod 000 'df.parquet.gzip' >> df.to_parquet('df.parquet.gzip', compression='gzip') --------------------------------------------------------------------------- PermissionError Traceback (most recent call last) <ipython-input-10-584c5c8752e0> in <module> ----> 1 df.to_parquet('df.parquet.gzip', compression='gzip') ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs) 2220 index=index, 2221 partition_cols=partition_cols, -> 2222 **kwargs 2223 ) 2224 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs) 252 index=index, 253 partition_cols=partition_cols, --> 254 **kwargs 255 ) 256 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs) 115 compression=compression, 116 coerce_timestamps=coerce_timestamps, --> 117 **kwargs 118 ) 119 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, flavor, **kwargs) 1122 compression=compression, 1123 use_deprecated_int96_timestamps=use_int96, -> 1124 **kwargs) as writer: 1125 writer.write_table(table, row_group_size=row_group_size) 1126 except Exception: ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, where, schema, flavor, version, use_dictionary, compression, use_deprecated_int96_timestamps, **options) 338 if _is_path_like(where): 339 fs = _get_fs_from_path(where) --> 340 sink = self.file_handle = fs.open(where, 'wb') 341 else: 342 sink = where ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/filesystem.py in open(self, path, mode) 243 """ 244 path = _stringify_path(path) --> 245 return open(path, mode=mode) 246 247 @property PermissionError: [Errno 13] Permission denied: 'df.parquet.gzip' >> !ls -l 'df.parquet.gzip' ls: cannot access df.parquet.gzip: No such file or directory -- This message was sent by Atlassian Jira (v8.3.4#803005)