thisisnic commented on a change in pull request #12154:
URL: https://github.com/apache/arrow/pull/12154#discussion_r785521592
##########
File path: r/R/util.R
##########
@@ -209,3 +209,74 @@ handle_csv_read_error <- function(e, schema) {
abort(e)
}
+
+
+parse_period_unit <- function(x) {
+
+ # the regexp matches against fractional units, but per lubridate
+ # supports integer multiples of a known unit only
+ match_info <- regexpr(
+ pattern = " *(?<multiple>[0-9.,]+)? *(?<unit>[^ \t\n]+)",
+ text = x[[1]],
+ perl = TRUE
+ )
+
+ capture_start <- attr(match_info, "capture.start")
+ capture_length <- attr(match_info, "capture.length")
+ capture_end <- capture_start + capture_length - 1L
+
+ str_unit <- substr(x, capture_start[[2]], capture_end[[2]])
+ str_multiple <- substr(x, capture_start[[1]], capture_end[[1]])
+
+ known_units <- c("nanosecond", "microsecond", "millisecond", "second",
+ "minute", "hour", "day", "week", "month", "quarter", "year")
+
+ # match the period unit
+ str_unit_start <- substr(str_unit, 1, 3)
+ unit <- as.integer(pmatch(str_unit_start, known_units)) - 1L
+
+ if(any(is.na(unit))) {
+ abort(sprintf("Unknown unit '%s'", str_unit))
+ }
+
+ # empty string in multiple interpreted as 1
+ if(capture_length[[1]] == 0) {
+ multiple <- 1L
+
+ } else {
+
+ # special cases: interpret fractions of 1 second as integer
+ # multiples of nanoseconds, microseconds, or milliseconds
+ # to mirror lubridate syntax
+ multiple <- as.numeric(str_multiple)
+
+ if(unit == 3L & multiple < 10^-6) {
+ unit <- 0L
+ multiple <- 10^9 * multiple
+ }
+ if(unit == 3L & multiple < 10^-3) {
+ unit <- 1L
+ multiple <- 10^6 * multiple
+ }
+ if(unit == 3L & multiple < 1) {
+ unit <- 2L
+ multiple <- 10^3 * multiple
+ }
+
+ multiple <- as.integer(multiple)
+ }
+
+ # more special cases: lubridate imposes sensible maximum
+ # values on the number of seconds, minutes and hours
+ if(unit == 3L & multiple > 60) {
+ abort("Rounding with second > 60 is not supported")
+ }
Review comment:
I can see where you are coming from here. I guess, generally, the aim
when implementing the functions for non-standard evaluation code blocks is to
(as closely as possible unless there's good reason not to!) replicate the
original. However, there are some rare cases when I've seen us not do that;
for example, our mapping to `lubridate::second()` supports nanoseconds as we
didn't see the harm in having more precise results, and so the unit test for
that binding looks like this:
https://github.com/apache/arrow/blob/0c1fd88953585485b772dfd405bbc5b1b5417324/r/tests/testthat/test-dplyr-funcs-datetime.R#L474-L483
That said, I also can't think of a reason someone else's code would depend
on raising an error *but* I'm not confident that that would mean it won't -
people have all sorts of unusual workflows.
There are multiple ways of calling Arrow C++ compute functions from R via
the `dplyr`/non-standard evaluation code - either using the bindings to the
`lubridate` etc equivalent, or calling the Arrow C++ compute function directly
by prefixing the function name with `arrow_` (so in this case it'd be
`arrow_round_temporal`; if you've not come across this before, check it out
here:
https://arrow.apache.org/cookbook/r/manipulating-data---tables.html#use-arrow-functions-in-dplyr-verbs-in-arrow).
So if people use that directly they'd still have access to the full
functionality. I'm not sure what to suggest here - would be good to get input
from @jonkeane or @nealrichardson
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]