westonpace commented on a change in pull request #11552:
URL: https://github.com/apache/arrow/pull/11552#discussion_r737037093
##########
File path: r/R/dataset-write.R
##########
@@ -97,6 +97,7 @@ write_dataset <- function(dataset,
partitioning = dplyr::group_vars(dataset),
basename_template = paste0("part-{i}.",
as.character(format)),
hive_style = TRUE,
+ existing_data_behavior = c("overwrite", "error",
"delete_matching"),
Review comment:
I'll add docs. An append behavior would be nice, but I think it's been
rejected in the past. There are several approaches that could be taken:
1. Scan the directory before we start writing to find the largest counter
value currently in use and start counting from there.
2. When we're about to write a file look to see if the filename already
exists and increment some counter (e.g. when downloading from Firefox/Chrome
you get `foo.txt` and `foo(1).txt`.
3. Allow a UUID to be used instead of a counter in the basename template.
For example, you could use a basename template of `{uuid}-{i}`
The JIRA for this is https://issues.apache.org/jira/browse/ARROW-10695 and
the outcome was that the user is capable of fixing this themselves. For
example, users can generate a UUID themselves every time they call
write_dataset and include that as part of the basename template (e.g. see
https://stackoverflow.com/questions/69184289/pyarrow-overwrites-dataset-when-using-s3-filesystem/69185178#69185178
)
"Delete matching" is pretty niche. The origin of the feature is
https://issues.apache.org/jira/browse/ARROW-12358 and the use case was
something like:
* Every Friday user downloads data for the week and gets partial data for
the current day (friday)
* The next week the user does the same thing and this time they have the
full data for last friday and they want to overwrite that partition of data but
keep all of the other days.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]