djnavarro commented on a change in pull request #12154:
URL: https://github.com/apache/arrow/pull/12154#discussion_r785547887



##########
File path: r/R/util.R
##########
@@ -209,3 +209,74 @@ handle_csv_read_error <- function(e, schema) {
 
   abort(e)
 }
+
+
+parse_period_unit <- function(x) {
+
+  # the regexp matches against fractional units, but per lubridate
+  # supports integer multiples of a known unit only
+  match_info <- regexpr(
+    pattern = " *(?<multiple>[0-9.,]+)? *(?<unit>[^ \t\n]+)",
+    text = x[[1]],
+    perl = TRUE
+  )
+
+  capture_start <- attr(match_info, "capture.start")
+  capture_length <- attr(match_info, "capture.length")
+  capture_end <- capture_start + capture_length - 1L
+
+  str_unit <- substr(x, capture_start[[2]], capture_end[[2]])
+  str_multiple <- substr(x, capture_start[[1]], capture_end[[1]])
+
+  known_units <- c("nanosecond", "microsecond", "millisecond", "second",
+                   "minute", "hour", "day", "week", "month", "quarter", "year")
+
+  # match the period unit
+  str_unit_start <- substr(str_unit, 1, 3)
+  unit <- as.integer(pmatch(str_unit_start, known_units)) - 1L
+
+  if(any(is.na(unit))) {
+    abort(sprintf("Unknown unit '%s'", str_unit))
+  }
+
+  # empty string in multiple interpreted as 1
+  if(capture_length[[1]] == 0) {
+    multiple <- 1L
+
+  } else {
+
+    # special cases: interpret fractions of 1 second as integer
+    # multiples of nanoseconds, microseconds, or milliseconds
+    # to mirror lubridate syntax
+    multiple <- as.numeric(str_multiple)
+
+    if(unit == 3L & multiple < 10^-6) {
+      unit <- 0L
+      multiple <- 10^9 * multiple
+    }
+    if(unit == 3L & multiple < 10^-3) {
+      unit <- 1L
+      multiple <- 10^6 * multiple
+    }
+    if(unit == 3L & multiple < 1) {
+      unit <- 2L
+      multiple <- 10^3 * multiple
+    }
+
+    multiple <- as.integer(multiple)
+  }
+
+  # more special cases: lubridate imposes sensible maximum
+  # values on the number of seconds, minutes and hours
+  if(unit == 3L & multiple > 60) {
+    abort("Rounding with second > 60 is not supported")
+  }

Review comment:
       @thisisnic: So yeah... I haven't finished my dive into the lubridate 
code on the `change_on_boundary` issue, but (I think!) the TL;DR is that it's 
an edge case for how `ceiling_date()` handles time points that sit at the start 
of a day. Should the ceiling of `2022-01-17 00:00:00` be `2022-01-18 00:00` or 
`2022-01-17 00:00`? Conceptually, "January 17" is an interval of time that 
starts at 00:00 and ends at 23:59, so a strict interpretation suggests that all 
time points within that interval should ceiling to midnight of January 18. 
That's what lubridate currently does, but...
   
   Prior to lubridate 1.6.0 `ceiling_date()` took a different approach, and 
AFAICT the C++ `round_temporal()` function does the same thing. If we were to 
ignore the strict interpretation of ISO 8601, we can think of "midnight 
moments" as analogous to integer values spaced along the real line, and round 
accordingly. Under this approach, which I think is more intuitive but does 
violate ISO 8601, we don't ceiling an integer up to the next integer, and so 
`2022-01-17 00:00` stays `2022-01-17 00:00` because it's already an "integer 
valued date".
   
   The downside to violating ISO 8601 -- in addition to the obvious problem 
that it violates ISO 8601 -- is that it means we're no longer in agreement with 
lubridate. The downside to adhering to it is that you get counterintuitive 
result that taking a ceiling of a Date object always returns the next day:
   
   ```r
   ceiling_date(ymd("2022-01-17"), "day") # returns 2022-01-18
   ```
   
   Long story short, the `change_on_boundary` argument exists so that lubridate 
can revert to its older behaviour. Given that the C++ `round_temporal()` only 
matches the older lubridate behaviour (again, assuming I've understood the code 
properly), I'll need to write a shim that supports the `change_on_boundary` 
anyway. If I don't we're not mirroring lubridate.
   
   I think the right approach is probably the same one @rok suggested for 
`week_start`. Short term the solution is to write a shim to support it on the R 
side, and open a JIRA ticket for the C++ library




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to