schelhorn opened a new issue #11781:
URL: https://github.com/apache/arrow/issues/11781
Dear Support,
thank you for arrow - it's great! I was wondering how the disk storage
transaction model for `arrow`'s R interface behaves concerning partial writes
of Parquet partitions/parts to disk.
If I have a (new or existing) Parquet dataset managed by `arrow`, and I am
adding a partition to that dataset using:
```r
arrow::write_dataset(dataset=my_df,
path='my_dataset.parquet',
format='parquet',
partitioning='my_grouping_var',
compression='snappy',
use_dictionary=T,
write_statistics=T)
```
and the write is _interrupted_ for some reason (for instance, due to no disk
space left on device, network connection to shared storage stalls, or main
memory issues), will the partial write still remain within the dataset as a
broken partition part file? If so, is there any way to recover from that
situation?
Secondly, is it safe to have two or more R processes add partitions/parts to
the Parquet dataset in parallel, and, if so, how is it ensured that no race
conditions occur, i.e., that the part files that are generated are not
overwriting each other?
Thanks a bunch!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]