[ 
https://issues.apache.org/jira/browse/ARROW-10611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-10611:
---------------------------------
    Summary: [Python] Deletion of existing file when write_table fails  (was: 
Deletion of existing file when write_table fails)

> [Python] Deletion of existing file when write_table fails
> ---------------------------------------------------------
>
>                 Key: ARROW-10611
>                 URL: https://issues.apache.org/jira/browse/ARROW-10611
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 1.0.1, 2.0.0
>         Environment: Docker container with Jupyter notebook; RHEL 7.6; POSIX 
> compliant filesystem
>            Reporter: Stefan Lange
>            Priority: Major
>
> [https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737arrow/python/pyarrow/parquet.py|https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737]
> Line 1737 in 
> [47f2e0c|https://github.com/apache/arrow/commit/47f2e0cb03ed8ad265e0688ada8162bf46066483]
> | |os.remove(_stringify_path(where))|
>  
> When write_table encounters a problem, the exception handler removes the 
> attempted output parquet file (see snippet below).
> This logic makes sense in order to make sure no file with inconsistent 
> content/state remains.
> However, if a file with the same name already exists, it gets also deleted.
> Would it make sense to add an option to let the user choose the behaviour in 
> such a case, e. g. to choose to keep an existing file and to only overwrite 
> it if the action is successful?
> And/or: Would it make sense to check early if the intended file can be 
> written and fail early if that is not the case (without deleting a 
> preexisting file)?
> E. g. if the directory has permission 755 and the already existing file has 
> permission 444, then the write attempt fails with a PermissionError but the 
> exception handler deletes the preexisting file. This behaviour is a bit 
> counterintuitive?
> Or would you say the responsibility lies with the people setting the 
> file/directory permissions right?
>  
> {{except Exception:}}
> {{   if _is_path_like(where):}}
> {{      try:}}
> {{         os.remove(_stringify_path(where))}}
> {{      except os.error:}}
> {{         pass}}
> {{}}
> As [https://github.com/chr1st1ank] added in 
> [https://github.com/apache/arrow/issues/8607]: 
> This can be reproduced with the following commands in ipython. Pandas uses 
> pyarrow to read/write Parquet.
> In effect the attempt to write to a file without write permissions to it 
> results in the deletion of this file (of course only if the user has 
> permissions enough on the directory to delete the file).
> >> import pandas as pd
> >> 
> >> df = pd.DataFrame(data=\{'col1': [1, 2], 'col2': [3, 4]})
> >> df.to_parquet('df.parquet.gzip', compression='gzip')
> >> pd.read_parquet('df.parquet.gzip')
> >> !ls -l 'df.parquet.gzip'
> -rw-r--r-- 1 myuser domain users 1529 Nov 13 09:31 df.parquet.gzip
> >> !chmod 000 'df.parquet.gzip'
> >> df.to_parquet('df.parquet.gzip', compression='gzip')
> ---------------------------------------------------------------------------
>  
>  PermissionError Traceback (most recent call last)
>  
>  <ipython-input-10-584c5c8752e0> in <module>
>  ----> 1 df.to_parquet('df.parquet.gzip', compression='gzip')
>  ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/core/frame.py in 
> to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
>  2220 index=index,
>  2221 partition_cols=partition_cols,
>  -> 2222 **kwargs
>  2223 )
>  2224
>  ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in 
> to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
>  252 index=index,
>  253 partition_cols=partition_cols,
>  --> 254 **kwargs
>  255 )
>  256
>  ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in 
> write(self, df, path, compression, coerce_timestamps, index, partition_cols, 
> **kwargs)
>  115 compression=compression,
>  116 coerce_timestamps=coerce_timestamps,
>  --> 117 **kwargs
>  118 )
>  119
>  ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in 
> write_table(table, where, row_group_size, version, use_dictionary, 
> compression, use_deprecated_int96_timestamps, coerce_timestamps, 
> allow_truncated_timestamps, flavor, **kwargs)
>  1122 compression=compression,
>  1123 use_deprecated_int96_timestamps=use_int96,
>  -> 1124 **kwargs) as writer:
>  1125 writer.write_table(table, row_group_size=row_group_size)
>  1126 except Exception:
>  ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in 
> __init__(self, where, schema, flavor, version, use_dictionary, compression, 
> use_deprecated_int96_timestamps, **options)
>  338 if _is_path_like(where):
>  339 fs = _get_fs_from_path(where)
>  --> 340 sink = self.file_handle = fs.open(where, 'wb')
>  341 else:
>  342 sink = where
>  ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/filesystem.py in 
> open(self, path, mode)
>  243 """
>  244 path = _stringify_path(path)
>  --> 245 return open(path, mode=mode)
>  246 
>  247 @property
>  PermissionError: [Errno 13] Permission denied: 'df.parquet.gzip'
>  
> >> !ls -l 'df.parquet.gzip'
>  ls: cannot access df.parquet.gzip: No such file or directory
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to