jonkeane commented on a change in pull request #12433:
URL: https://github.com/apache/arrow/pull/12433#discussion_r817932243



##########
File path: r/tests/testthat/test-dplyr-funcs-type.R
##########
@@ -768,3 +769,138 @@ test_that("nested structs can be created from scalars and 
existing data frames",
     tibble(a = 1:2)
   )
 })
+
+test_that("as.Date() converts successfully from date, timestamp, integer, char 
and double", {
+  test_df <- tibble::tibble(
+    posixct_var = as.POSIXct("2022-02-25 00:00:01", tz = "Europe/London"),
+    date_var = as.Date("2022-02-25"),
+    character_ymd_var = "2022-02-25 00:00:01",
+    character_ydm_var = "2022/25/02 00:00:01",
+    integer_var = 32L,
+    double_var = 34.56
+  )
+
+  # casting from POSIXct treated separately so we can skip on Windows
+  # TODO move the test for casting from POSIXct below once ARROW-13168 is done
+  compare_dplyr_binding(
+    .input %>%
+      mutate(
+        date_dv = as.Date(date_var),
+        date_char_ymd = as.Date(character_ymd_var, format = "%Y-%m-%d 
%H:%M:%S"),
+        date_char_ydm = as.Date(character_ydm_var, format = "%Y/%d/%m 
%H:%M:%S"),
+        date_int = as.Date(integer_var, origin = "1970-01-01")
+      ) %>%
+      collect(),
+    test_df
+  )
+
+  # the way we go about it is a bit different, but the result is the same =>
+  # testing without compare_dplyr_binding()
+  expect_equal(
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  expect_equal(
+    test_df %>%
+      record_batch() %>%
+      mutate(date_double = as.Date(double_var)) %>%
+      collect(),
+    test_df %>%
+      arrow_table() %>%
+      mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+      collect()
+  )
+
+  # actual and expected differ due to doubles are accounted for (floored in
+  # arrow and rounded to the next decimal in R)
+  expect_error(
+    compare_dplyr_binding(
+      .input %>%
+        mutate(date_double = as.Date(double_var, origin = "1970-01-01")) %>%
+        collect(),
+      test_df
+    )
+  )

Review comment:
       Thanks for the explanation. Is this part of the comment still accurate 
then: `(floored in arrow and rounded to the next decimal in R)`? 
   
   I suspect (but don't know for certain!) what's going on is that you're 
running into how R stores dates and how that differs from 
[`date32()`](https://arrow.apache.org/docs/cpp/api/datatype.html?highlight=date32#classarrow_1_1_date32_type)
 in Arrow. In R, a date object can be a float (I haven't looked at the source 
to see if it's _always_ stored as a float, but that would be interesting to 
know!) and that number is number of days since the epoch [1]. So in R you can 
have fractional days:
   
   ```
   > as.Date(36.54, origin = "1970-01-01")
   [1] "1970-02-06"
   > as.numeric(as.Date(36.54, origin = "1970-01-01"))
   [1] 36.54
   ```
   
   So if you add a small amount (but enough to get to the next whole number 
you'll see a new date:
   
   ```
   > as.Date(36.54, origin = "1970-01-01") + 0.46
   [1] "1970-02-07"
   ```
   
   But if we actually floored here, we would get the integer, and adding the 
same amount won't get you to the next day (just to a bit before noon here):
   
   ```
   > as.numeric(as.Date(floor(36.54), origin = "1970-01-01"))
   [1] 36
   > as.Date(floor(36.54), origin = "1970-01-01") + 0.46
   [1] "1970-02-06"
   ```
   
   Soooo, this means for us that we need to choose from (in order of best to 
worst IMO, but all would be fine I think):
   
   * Store a more precise value (e.g. as 
[`date64()`](https://arrow.apache.org/docs/cpp/api/datatype.html?highlight=date32#classarrow_1_1_date64_type)
 though we can't simply `cast(x, date64())` because `date64()` stores 
milliseconds since the epoch. We also might still have some complications 
comparison — I haven't experimented with `date64()` objects getting pulled back 
into R and if they come in as Dates backed by floats.
   * Not accept non-integers at all with an error and make a Jira to clean this 
up later.
   * Accept that Arrow simple floors, and the actual numeric values are 
different 
   
   
   
   [1] — and it actually is the epoch, it converts from a different origin:
   
   ```
   as.numeric(as.Date(36.54, origin = "1999-12-31"))
   [1] 10992.54
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to