thisisnic commented on a change in pull request #12154:
URL: https://github.com/apache/arrow/pull/12154#discussion_r785521592



##########
File path: r/R/util.R
##########
@@ -209,3 +209,74 @@ handle_csv_read_error <- function(e, schema) {
 
   abort(e)
 }
+
+
+parse_period_unit <- function(x) {
+
+  # the regexp matches against fractional units, but per lubridate
+  # supports integer multiples of a known unit only
+  match_info <- regexpr(
+    pattern = " *(?<multiple>[0-9.,]+)? *(?<unit>[^ \t\n]+)",
+    text = x[[1]],
+    perl = TRUE
+  )
+
+  capture_start <- attr(match_info, "capture.start")
+  capture_length <- attr(match_info, "capture.length")
+  capture_end <- capture_start + capture_length - 1L
+
+  str_unit <- substr(x, capture_start[[2]], capture_end[[2]])
+  str_multiple <- substr(x, capture_start[[1]], capture_end[[1]])
+
+  known_units <- c("nanosecond", "microsecond", "millisecond", "second",
+                   "minute", "hour", "day", "week", "month", "quarter", "year")
+
+  # match the period unit
+  str_unit_start <- substr(str_unit, 1, 3)
+  unit <- as.integer(pmatch(str_unit_start, known_units)) - 1L
+
+  if(any(is.na(unit))) {
+    abort(sprintf("Unknown unit '%s'", str_unit))
+  }
+
+  # empty string in multiple interpreted as 1
+  if(capture_length[[1]] == 0) {
+    multiple <- 1L
+
+  } else {
+
+    # special cases: interpret fractions of 1 second as integer
+    # multiples of nanoseconds, microseconds, or milliseconds
+    # to mirror lubridate syntax
+    multiple <- as.numeric(str_multiple)
+
+    if(unit == 3L & multiple < 10^-6) {
+      unit <- 0L
+      multiple <- 10^9 * multiple
+    }
+    if(unit == 3L & multiple < 10^-3) {
+      unit <- 1L
+      multiple <- 10^6 * multiple
+    }
+    if(unit == 3L & multiple < 1) {
+      unit <- 2L
+      multiple <- 10^3 * multiple
+    }
+
+    multiple <- as.integer(multiple)
+  }
+
+  # more special cases: lubridate imposes sensible maximum
+  # values on the number of seconds, minutes and hours
+  if(unit == 3L & multiple > 60) {
+    abort("Rounding with second > 60 is not supported")
+  }

Review comment:
       I can see where you are coming from here.  I guess, generally, the aim 
when implementing the functions for non-standard evaluation code blocks is to 
(as closely as possible unless there's good reason not to!) replicate the 
original.  However, there are some rare cases when I've seen us not do that; 
for example, our mapping to `lubridate::second()` supports nanoseconds as we 
didn't see the harm in having more precise results, and so the unit test for 
that binding looks like this: 
https://github.com/apache/arrow/blob/0c1fd88953585485b772dfd405bbc5b1b5417324/r/tests/testthat/test-dplyr-funcs-datetime.R#L474-L483
 
   
   That said, I also can't think of a reason someone else's code would depend 
on raising an error *but* I'm not confident that that would mean it won't - 
people have all sorts of unusual workflows.  
   
   There are multiple ways of calling Arrow C++ compute functions from R via 
the `dplyr`/non-standard evaluation code - either using the bindings to the 
`lubridate` etc equivalent, or calling the Arrow C++ compute function directly 
by prefixing the function name with `arrow_` (so in this case it'd be 
`arrow_round_temporal`; if you've not come across this before, check it out 
here: 
https://arrow.apache.org/cookbook/r/manipulating-data---tables.html#use-arrow-functions-in-dplyr-verbs-in-arrow).
   
   So if people use that directly they'd still have access to the full 
functionality. I'm not sure what to suggest here - would be good to get input 
from @jonkeane or @nealrichardson 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to