[GitHub] [arrow] schelhorn opened a new issue #11781: Is adding Parquet partitions/part files using R's arrow::write_dataset() atomic?

GitBox Fri, 26 Nov 2021 01:07:51 -0800


schelhorn opened a new issue #11781:
URL: https://github.com/apache/arrow/issues/11781



   Dear Support,
   
   thank you for arrow - it's great! I was wondering how the disk storage 
transaction model for `arrow`'s R interface behaves concerning partial writes 
of Parquet partitions/parts to disk.
   
   If I have a (new or existing) Parquet dataset managed by `arrow`, and I am 
adding a partition to that dataset using:
   ```r
   arrow::write_dataset(dataset=my_df,
                        path='my_dataset.parquet',
                        format='parquet',
                        partitioning='my_grouping_var',
                        compression='snappy',
                        use_dictionary=T,
                        write_statistics=T)
   ```
   and the write is _interrupted_ for some reason (for instance, due to no disk 
space left on device, network connection to shared storage stalls, or main 
memory issues), will the partial write still remain within the dataset as a 
broken partition part file? If so, is there any way to recover from that 
situation?
   
   Secondly, is it safe to have two or more R processes add partitions/parts to 
the Parquet dataset in parallel, and, if so, how is it ensured that no race 
conditions occur, i.e., that the part files that are generated are not 
overwriting each other?
   
   Thanks a bunch!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] schelhorn opened a new issue #11781: Is adding Parquet partitions/part files using R's arrow::write_dataset() atomic?

Reply via email to