jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r814346290
##########
File path: r/R/dplyr-funcs-type.R
##########
@@ -76,6 +76,13 @@ register_bindings_type_cast <- function() {
register_binding("as.numeric", function(x) {
Expression$create("cast", x, options = cast_options(to_type = float64()))
})
+ register_binding("as.Date", function(x) {
+ # base::as.Date() first converts to UTC and then extracts the date, which
is
+ # why we need to go through timestamp() first - see unit tests for the real
+ # life impact of the difference between lubridate::date() and
base::as.Date()
+ y <- build_expr("cast", x, options = cast_options(to_type = timestamp()))
Review comment:
I assume that this line is depending on the default timezone for
`timestamp()` to be UTC, yeah? Maybe we should be explicit about that here
since to make it a little bit easier to read | reason about?
##########
File path: r/tests/testthat/test-dplyr-funcs-datetime.R
##########
@@ -801,3 +801,85 @@ test_that("dst extracts daylight savings time correctly", {
test_df
)
})
+
+test_that("date works in arrow", {
+ # https://issues.apache.org/jira/browse/ARROW-13168
+ skip_on_os("windows")
+ # this date is specific since lubridate::date() is different from
base::as.Date()
+ # since as.Date returns the UTC date and date() doesn't
+ test_df <- tibble(
+ a = as.POSIXct(c("2012-03-26 23:12:13", NA), tz = "America/New_York"))
+
+ # we can't (for now) use namespacing, so we need to make sure
lubridate::date()
+ # and not base::date() is being used
+ # TODO remove once https://issues.apache.org/jira/browse/ARROW-14575 is done
+ date <- lubridate::date
+ compare_dplyr_binding(
+ .input %>%
+ mutate(a_date = date(a)) %>%
+ collect(),
+ test_df
+ )
+
+ compare_dplyr_binding(
+ .input %>%
+ mutate(a_date_base = as.Date(a)) %>%
+ collect(),
+ test_df
+ )
+
+ r_date_object <- lubridate::ymd_hms("2012-03-26 23:12:13")
+ compare_dplyr_binding(
+ .input %>%
+ mutate(b = date(r_date_object)) %>%
+ collect(),
+ test_df
+ )
+
+ compare_dplyr_binding(
+ .input %>%
+ mutate(b_base = as.Date(r_date_object)) %>%
+ collect(),
+ test_df
+ )
+
+ skip("All these will fail as we're not actually forcing evaluation")
+ # a timestamp is cast correctly to date
+ expect_equal(
+ call_binding("date", Array$create(as.POSIXct("2022-02-21"))),
+ Array$create(as.POSIXct("2022-02-21"), type = date32())
+ )
+
+ # date() supports R objects
+ expect_equal(
+ call_binding("date", as.POSIXct("2022-02-21")),
+ Array$create(as.POSIXct("2022-02-21"), type = date32())
+ )
+})
+
+test_that("date() errors with unsupported inputs", {
+ skip("All these will fail as we're not actually forcing evaluation")
+ expect_error(
+ call_binding("date", Scalar$create("a string")),
+ "NotImplemented: Unsupported cast from string to date32 using function
cast_date32"
+ )
+
+ expect_error(
+ call_binding("date", Scalar$create(32.2)),
+ "NotImplemented: Unsupported cast from double to date32 using function
cast_date32"
+ )
+
+ expect_error(
+ arrow_eval(call_binding("date", Scalar$create(TRUE)), mask =
arrow_mask(list())),
+ "NotImplemented: Unsupported cast from bool to date32 using function
cast_date32"
+ )
+
+ # if we are aiming for equivalent behaviour to lubridate this should fail,
but
+ # it doesn't as it is supported in arrow, where integer casting to date
returns
+ # the date x days away from epoch
+ skip("supported in arrow, but not in lubridate")
+ expect_error(
+ call_binding("date", Scalar$create(32L)),
+ "NotImplemented: Unsupported cast from integer to date32 using function
cast_date32"
+ )
Review comment:
I would say that we should test this, even if we're diverging from
dplyr. If it's easier to do this in a dplyr pipeline compared with evaluating
the binding directly, feel free to do that.
Interestingly(?) `lubridate::date()` doesn't outright reject numerics
(though one needs to supply an origin in R, which does error). I wonder if
there is an issue | if we should raise one with lubridate to either support
that, or error with something more helpful. Though if lubridate did support
this via POSIXlt, instead of being one day after the epoch, it is actually one
second after the epoch(!):
```
> as.POSIXlt(1, origin = as.Date("1970-01-01"), tz = "UTC")
[1] "1970-01-01 00:00:01 UTC"
```
Anyway TL;DR, having arrow support `date(1)` being one day after the epoch I
think is totally fine, but we should make a test for that here that confirms
that we expect that behavior.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]