djnavarro commented on a change in pull request #12154: URL: https://github.com/apache/arrow/pull/12154#discussion_r785547887
########## File path: r/R/util.R ########## @@ -209,3 +209,74 @@ handle_csv_read_error <- function(e, schema) { abort(e) } + + +parse_period_unit <- function(x) { + + # the regexp matches against fractional units, but per lubridate + # supports integer multiples of a known unit only + match_info <- regexpr( + pattern = " *(?<multiple>[0-9.,]+)? *(?<unit>[^ \t\n]+)", + text = x[[1]], + perl = TRUE + ) + + capture_start <- attr(match_info, "capture.start") + capture_length <- attr(match_info, "capture.length") + capture_end <- capture_start + capture_length - 1L + + str_unit <- substr(x, capture_start[[2]], capture_end[[2]]) + str_multiple <- substr(x, capture_start[[1]], capture_end[[1]]) + + known_units <- c("nanosecond", "microsecond", "millisecond", "second", + "minute", "hour", "day", "week", "month", "quarter", "year") + + # match the period unit + str_unit_start <- substr(str_unit, 1, 3) + unit <- as.integer(pmatch(str_unit_start, known_units)) - 1L + + if(any(is.na(unit))) { + abort(sprintf("Unknown unit '%s'", str_unit)) + } + + # empty string in multiple interpreted as 1 + if(capture_length[[1]] == 0) { + multiple <- 1L + + } else { + + # special cases: interpret fractions of 1 second as integer + # multiples of nanoseconds, microseconds, or milliseconds + # to mirror lubridate syntax + multiple <- as.numeric(str_multiple) + + if(unit == 3L & multiple < 10^-6) { + unit <- 0L + multiple <- 10^9 * multiple + } + if(unit == 3L & multiple < 10^-3) { + unit <- 1L + multiple <- 10^6 * multiple + } + if(unit == 3L & multiple < 1) { + unit <- 2L + multiple <- 10^3 * multiple + } + + multiple <- as.integer(multiple) + } + + # more special cases: lubridate imposes sensible maximum + # values on the number of seconds, minutes and hours + if(unit == 3L & multiple > 60) { + abort("Rounding with second > 60 is not supported") + } Review comment: @thisisnic: So yeah... I haven't finished my dive into the lubridate code on the `change_on_boundary` issue, but (I think!) the TL;DR is that it's an edge case for how `ceiling_date()` handles time points that sit at the start of a day. Should the ceiling of `2022-01-17 00:00:00` be `2022-01-18 00:00` or `2022-01-17 00:00`? Conceptually, "January 17" is an interval of time that starts at 00:00 and ends at 23:59, so a strict interpretation suggests that all time points within that interval should ceiling to midnight of January 18. That's what lubridate currently does, but... Prior to lubridate 1.6.0 `ceiling_date()` took a different approach, and AFAICT the C++ `round_temporal()` function does the same thing. If we were to ignore the strict interpretation of ISO 8601, we can think of "midnight moments" as analogous to integer values spaced along the real line, and round accordingly. Under this approach, which I think is more intuitive but does violate ISO 8601, we don't ceiling an integer up to the next integer, and so `2022-01-17 00:00` stays `2022-01-17 00:00` because it's already an "integer valued date". The downside to violating ISO 8601 -- in addition to the obvious problem that it violates ISO 8601 -- is that it means we're no longer in agreement with lubridate. The downside to adhering to it is that you get counterintuitive result that taking a ceiling of a Date object always returns the next day: ```r ceiling_date(ymd("2022-01-17"), "day") # returns 2022-01-18 ``` Long story short, the `change_on_boundary` argument exists so that lubridate can revert to its older behaviour. Given that the C++ `round_temporal()` only matches the older lubridate behaviour (again, assuming I've understood the code properly), I'll need to write a shim that supports the `change_on_boundary` anyway. If I don't we're not mirroring lubridate. I think the right approach is probably the same one @rok suggested for `week_start`. Short term the solution is to write a shim to support it on the R side, and open a JIRA ticket for the C++ library -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org