[
https://issues.apache.org/jira/browse/ARROW-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-10611:
---------------------------------
Summary: [Python] Deletion of existing file when write_table fails (was:
Deletion of existing file when write_table fails)
> [Python] Deletion of existing file when write_table fails
> ---------------------------------------------------------
>
> Key: ARROW-10611
> URL: https://issues.apache.org/jira/browse/ARROW-10611
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 1.0.1, 2.0.0
> Environment: Docker container with Jupyter notebook; RHEL 7.6; POSIX
> compliant filesystem
> Reporter: Stefan Lange
> Priority: Major
>
> [https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737arrow/python/pyarrow/parquet.py|https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737]
> Line 1737 in
> [47f2e0c|https://github.com/apache/arrow/commit/47f2e0cb03ed8ad265e0688ada8162bf46066483]
> | |os.remove(_stringify_path(where))|
>
> When write_table encounters a problem, the exception handler removes the
> attempted output parquet file (see snippet below).
> This logic makes sense in order to make sure no file with inconsistent
> content/state remains.
> However, if a file with the same name already exists, it gets also deleted.
> Would it make sense to add an option to let the user choose the behaviour in
> such a case, e. g. to choose to keep an existing file and to only overwrite
> it if the action is successful?
> And/or: Would it make sense to check early if the intended file can be
> written and fail early if that is not the case (without deleting a
> preexisting file)?
> E. g. if the directory has permission 755 and the already existing file has
> permission 444, then the write attempt fails with a PermissionError but the
> exception handler deletes the preexisting file. This behaviour is a bit
> counterintuitive?
> Or would you say the responsibility lies with the people setting the
> file/directory permissions right?
>
> {{except Exception:}}
> {{ if _is_path_like(where):}}
> {{ try:}}
> {{ os.remove(_stringify_path(where))}}
> {{ except os.error:}}
> {{ pass}}
> {{}}
> As [https://github.com/chr1st1ank] added in
> [https://github.com/apache/arrow/issues/8607]:
> This can be reproduced with the following commands in ipython. Pandas uses
> pyarrow to read/write Parquet.
> In effect the attempt to write to a file without write permissions to it
> results in the deletion of this file (of course only if the user has
> permissions enough on the directory to delete the file).
> >> import pandas as pd
> >>
> >> df = pd.DataFrame(data=\{'col1': [1, 2], 'col2': [3, 4]})
> >> df.to_parquet('df.parquet.gzip', compression='gzip')
> >> pd.read_parquet('df.parquet.gzip')
> >> !ls -l 'df.parquet.gzip'
> -rw-r--r-- 1 myuser domain users 1529 Nov 13 09:31 df.parquet.gzip
> >> !chmod 000 'df.parquet.gzip'
> >> df.to_parquet('df.parquet.gzip', compression='gzip')
> ---------------------------------------------------------------------------
>
> PermissionError Traceback (most recent call last)
>
> <ipython-input-10-584c5c8752e0> in <module>
> ----> 1 df.to_parquet('df.parquet.gzip', compression='gzip')
> ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/core/frame.py in
> to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
> 2220 index=index,
> 2221 partition_cols=partition_cols,
> -> 2222 **kwargs
> 2223 )
> 2224
> ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in
> to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
> 252 index=index,
> 253 partition_cols=partition_cols,
> --> 254 **kwargs
> 255 )
> 256
> ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in
> write(self, df, path, compression, coerce_timestamps, index, partition_cols,
> **kwargs)
> 115 compression=compression,
> 116 coerce_timestamps=coerce_timestamps,
> --> 117 **kwargs
> 118 )
> 119
> ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in
> write_table(table, where, row_group_size, version, use_dictionary,
> compression, use_deprecated_int96_timestamps, coerce_timestamps,
> allow_truncated_timestamps, flavor, **kwargs)
> 1122 compression=compression,
> 1123 use_deprecated_int96_timestamps=use_int96,
> -> 1124 **kwargs) as writer:
> 1125 writer.write_table(table, row_group_size=row_group_size)
> 1126 except Exception:
> ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in
> __init__(self, where, schema, flavor, version, use_dictionary, compression,
> use_deprecated_int96_timestamps, **options)
> 338 if _is_path_like(where):
> 339 fs = _get_fs_from_path(where)
> --> 340 sink = self.file_handle = fs.open(where, 'wb')
> 341 else:
> 342 sink = where
> ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/filesystem.py in
> open(self, path, mode)
> 243 """
> 244 path = _stringify_path(path)
> --> 245 return open(path, mode=mode)
> 246
> 247 @property
> PermissionError: [Errno 13] Permission denied: 'df.parquet.gzip'
>
> >> !ls -l 'df.parquet.gzip'
> ls: cannot access df.parquet.gzip: No such file or directory
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)