Stefan Lange created ARROW-10611:
------------------------------------

             Summary: Deletion of existing file when write_table fails
                 Key: ARROW-10611
                 URL: https://issues.apache.org/jira/browse/ARROW-10611
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 2.0.0, 1.0.1
         Environment: Docker container with Jupyter notebook; RHEL 7.6; POSIX 
compliant filesystem
            Reporter: Stefan Lange


[https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737arrow/python/pyarrow/parquet.py|https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737]

Line 1737 in 
[47f2e0c|https://github.com/apache/arrow/commit/47f2e0cb03ed8ad265e0688ada8162bf46066483]
| |os.remove(_stringify_path(where))|

 

When write_table encounters a problem, the exception handler removes the 
attempted output parquet file (see snippet below).
This logic makes sense in order to make sure no file with inconsistent 
content/state remains.
However, if a file with the same name already exists, it gets also deleted.

Would it make sense to add an option to let the user choose the behaviour in 
such a case, e. g. to choose to keep an existing file and to only overwrite it 
if the action is successful?
And/or: Would it make sense to check early if the intended file can be written 
and fail early if that is not the case (without deleting a preexisting file)?
E. g. if the directory has permission 755 and the already existing file has 
permission 444, then the write attempt fails with a PermissionError but the 
exception handler deletes the preexisting file. This behaviour is a bit 
counterintuitive?
Or would you say the responsibility lies with the people setting the 
file/directory permissions right?

 

{{except Exception:}}

{{   if _is_path_like(where):}}

{{      try:}}

{{         os.remove(_stringify_path(where))}}

{{      except os.error:}}

{{         pass}}

{{}}

As [https://github.com/chr1st1ank] added in 
[https://github.com/apache/arrow/issues/8607]: 

This can be reproduced with the following commands in ipython. Pandas uses 
pyarrow to read/write Parquet.
In effect the attempt to write to a file without write permissions to it 
results in the deletion of this file (of course only if the user has 
permissions enough on the directory to delete the file).

>> import pandas as pd
>> 
>> df = pd.DataFrame(data=\{'col1': [1, 2], 'col2': [3, 4]})
>> df.to_parquet('df.parquet.gzip', compression='gzip')
>> pd.read_parquet('df.parquet.gzip')
>> !ls -l 'df.parquet.gzip'

-rw-r--r-- 1 myuser domain users 1529 Nov 13 09:31 df.parquet.gzip


>> !chmod 000 'df.parquet.gzip'
>> df.to_parquet('df.parquet.gzip', compression='gzip')

---------------------------------------------------------------------------
 
 PermissionError Traceback (most recent call last)
 
 <ipython-input-10-584c5c8752e0> in <module>
 ----> 1 df.to_parquet('df.parquet.gzip', compression='gzip')


 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/core/frame.py in 
to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
 2220 index=index,
 2221 partition_cols=partition_cols,
 -> 2222 **kwargs
 2223 )
 2224


 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in 
to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
 252 index=index,
 253 partition_cols=partition_cols,
 --> 254 **kwargs
 255 )
 256


 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in 
write(self, df, path, compression, coerce_timestamps, index, partition_cols, 
**kwargs)
 115 compression=compression,
 116 coerce_timestamps=coerce_timestamps,
 --> 117 **kwargs
 118 )
 119


 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in 
write_table(table, where, row_group_size, version, use_dictionary, compression, 
use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, 
flavor, **kwargs)
 1122 compression=compression,
 1123 use_deprecated_int96_timestamps=use_int96,
 -> 1124 **kwargs) as writer:
 1125 writer.write_table(table, row_group_size=row_group_size)
 1126 except Exception:


 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in 
__init__(self, where, schema, flavor, version, use_dictionary, compression, 
use_deprecated_int96_timestamps, **options)
 338 if _is_path_like(where):
 339 fs = _get_fs_from_path(where)
 --> 340 sink = self.file_handle = fs.open(where, 'wb')
 341 else:
 342 sink = where


 ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/filesystem.py in 
open(self, path, mode)
 243 """
 244 path = _stringify_path(path)
 --> 245 return open(path, mode=mode)
 246 
 247 @property


 PermissionError: [Errno 13] Permission denied: 'df.parquet.gzip'

 

>> !ls -l 'df.parquet.gzip'
 ls: cannot access df.parquet.gzip: No such file or directory

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to