[GitHub] [arrow] westonpace commented on a change in pull request #11552: ARROW-14480: [R] Expose arrow::dataset::ExistingDataBehavior to R

GitBox Tue, 26 Oct 2021 18:33:00 -0700


westonpace commented on a change in pull request #11552:
URL: https://github.com/apache/arrow/pull/11552#discussion_r737037093




##########
File path: r/R/dataset-write.R
##########
@@ -97,6 +97,7 @@ write_dataset <- function(dataset,
                           partitioning = dplyr::group_vars(dataset),
                           basename_template = paste0("part-{i}.", 
as.character(format)),
                           hive_style = TRUE,
+                          existing_data_behavior = c("overwrite", "error", 
"delete_matching"),

Review comment:
       I'll add docs.  An append behavior would be nice, but I think it's been 
rejected in the past.  There are several approaches that could be taken:
   
   1. Scan the directory before we start writing to find the largest counter 
value currently in use and start counting from there.
   2. When we're about to write a file look to see if the filename already 
exists and increment some counter (e.g. when downloading from Firefox/Chrome 
you get `foo.txt` and `foo(1).txt`.
   3. Allow a UUID to be used instead of a counter in the basename template.  
For example, you could use a basename template of `{uuid}-{i}`
   
   The JIRA for this is https://issues.apache.org/jira/browse/ARROW-10695 and 
the outcome was that the user is capable of fixing this themselves.  For 
example, users can generate a UUID themselves every time they call 
write_dataset and include that as part of the basename template (e.g. see 
https://stackoverflow.com/questions/69184289/pyarrow-overwrites-dataset-when-using-s3-filesystem/69185178#69185178
 )
   
   "Delete matching" is pretty niche.  The origin of the feature is 
https://issues.apache.org/jira/browse/ARROW-12358 and the use case was 
something like:
    * Every Friday user downloads data for the week and gets partial data for 
the current day (friday)
    * The next week the user does the same thing and this time they have the 
full data for last friday and they want to overwrite that partition of data but 
keep all of the other days.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #11552: ARROW-14480: [R] Expose arrow::dataset::ExistingDataBehavior to R

Reply via email to