jonkeane commented on a change in pull request #12240:
URL: https://github.com/apache/arrow/pull/12240#discussion_r791019809
##########
File path: r/tests/testthat/test-Array.R
##########
@@ -985,3 +985,14 @@ test_that("Array to C-interface", {
delete_arrow_schema(schema_ptr)
delete_arrow_array(array_ptr)
})
+
+test_that("Array coverts timestamps with missing timezone /assumed local tz
correctly", {
+ withr::with_envvar(c(TZ = "America/Chicago"), {
Review comment:
We want this to actually be unset though, right? Similar to
https://github.com/apache/arrow/blob/daa5c18e9697a6455a7a75fec19594543c17b21e/r/tests/testthat/test-Array.R#L264-L267
to simulate the circumstance where TZ is unset (though we might want to use
`TZ = NA` instead of `TZ = ""` there since `NA` _unsets_ the variable instead
of simply setting it to `""`)
``` r
Sys.getenv("TZ")
#> [1] ""
timestamp_r <- as.POSIXct("2018-10-07 19:04:05")
timestamp_r
#> [1] "2018-10-07 19:04:05 CDT"
attributes(timestamp_r)
#> $class
#> [1] "POSIXct" "POSIXt"
#>
#> $tzone
#> [1] ""
as.integer(timestamp_r)
#> [1] 1538957045
Sys.setenv("TZ" = "Australia/Brisbane")
timestamp_r <- as.POSIXct("2018-10-07 19:04:05")
timestamp_r
#> [1] "2018-10-07 19:04:05 AEST"
attributes(timestamp_r)
#> $class
#> [1] "POSIXct" "POSIXt"
#>
#> $tzone
#> [1] ""
as.integer(timestamp_r)
#> [1] 1538903045
Sys.unsetenv("TZ")
timestamp_r <- as.POSIXct("2018-10-07 19:04:05")
timestamp_r
#> [1] "2018-10-07 19:04:05 CDT"
attributes(timestamp_r)
#> $class
#> [1] "POSIXct" "POSIXt"
#>
#> $tzone
#> [1] ""
as.integer(timestamp_r)
#> [1] 1538957045
```
##########
File path: r/tests/testthat/test-Array.R
##########
@@ -985,3 +985,14 @@ test_that("Array to C-interface", {
delete_arrow_schema(schema_ptr)
delete_arrow_array(array_ptr)
})
+
+test_that("Array coverts timestamps with missing timezone /assumed local tz
correctly", {
+ withr::with_envvar(c(TZ = "America/Chicago"), {
+ a <- as.POSIXct("1970-01-01 00:00:00")
+ attr(a, "tzone") <- Sys.getenv("TZ")
Review comment:
This is a good first step, but would it be better to have two timestamps
here? One that was created with `TZ` unset, and then one where we specifically
set the timezone with `attr(b, "tzone"), Sys.timezone())` And confirm that
those two arrays are equal?
##########
File path: r/tests/testthat/test-Array.R
##########
@@ -985,3 +985,14 @@ test_that("Array to C-interface", {
delete_arrow_schema(schema_ptr)
delete_arrow_array(array_ptr)
})
+
+test_that("Array coverts timestamps with missing timezone /assumed local tz
correctly", {
+ withr::with_envvar(c(TZ = "America/Chicago"), {
+ a <- as.POSIXct("1970-01-01 00:00:00")
+ attr(a, "tzone") <- Sys.getenv("TZ")
Review comment:
> But wouldn't those arrays be equal in their absolute value (without
the "tzone" medatadata)
This is what we expect + want, no? R creates a POSIXct by taking the
datetime string you have and converting it to the number of seconds from the
epoch based on the time string being in the local timezone of the session
(unless you proactively provide a different one). This is what I mean when I
say that for R the timezoneless timestamps are *not* naive, they are really
timestamps at a specific timezone, R just happens to spell that timezone
confusingly as `""` sometimes.
##########
File path: r/tests/testthat/test-Array.R
##########
@@ -985,3 +985,14 @@ test_that("Array to C-interface", {
delete_arrow_schema(schema_ptr)
delete_arrow_array(array_ptr)
})
+
+test_that("Array coverts timestamps with missing timezone /assumed local tz
correctly", {
+ withr::with_envvar(c(TZ = "America/Chicago"), {
+ a <- as.POSIXct("1970-01-01 00:00:00")
+ attr(a, "tzone") <- Sys.getenv("TZ")
Review comment:
The phrase "the display" here is confusing / wrong in some
circumstances. When printing arrays, currently AFAICT arrow prints the
timestamp in UTC for datetimes regardless if there is a timezone attached or
not:
``` r
library(arrow, warn.conflicts = FALSE)
# specifically setting the timezone, and the Arrow Array repl shows UTC
ts <- as.POSIXct("2020-01-01 02:00:00", tz = "America/Chicago") + 1:10*3600
ts
#> [1] "2020-01-01 03:00:00 CST" "2020-01-01 04:00:00 CST"
#> [3] "2020-01-01 05:00:00 CST" "2020-01-01 06:00:00 CST"
#> [5] "2020-01-01 07:00:00 CST" "2020-01-01 08:00:00 CST"
#> [7] "2020-01-01 09:00:00 CST" "2020-01-01 10:00:00 CST"
#> [9] "2020-01-01 11:00:00 CST" "2020-01-01 12:00:00 CST"
attr(ts, "tzone")
#> [1] "America/Chicago"
arr <- Array$create(ts)
arr
#> Array
#> <timestamp[us, tz=America/Chicago]>
#> [
#> 2020-01-01 09:00:00.000000,
#> 2020-01-01 10:00:00.000000,
#> 2020-01-01 11:00:00.000000,
#> 2020-01-01 12:00:00.000000,
#> 2020-01-01 13:00:00.000000,
#> 2020-01-01 14:00:00.000000,
#> 2020-01-01 15:00:00.000000,
#> 2020-01-01 16:00:00.000000,
#> 2020-01-01 17:00:00.000000,
#> 2020-01-01 18:00:00.000000
#> ]
arr$type$timezone()
#> [1] "America/Chicago"
as.vector(arr)
#> [1] "2020-01-01 03:00:00 CST" "2020-01-01 04:00:00 CST"
#> [3] "2020-01-01 05:00:00 CST" "2020-01-01 06:00:00 CST"
#> [5] "2020-01-01 07:00:00 CST" "2020-01-01 08:00:00 CST"
#> [7] "2020-01-01 09:00:00 CST" "2020-01-01 10:00:00 CST"
#> [9] "2020-01-01 11:00:00 CST" "2020-01-01 12:00:00 CST"
attr(as.vector(arr), "tzone")
#> [1] "America/Chicago"
# without setting the timezone, and the Arrow Array repl still shows UTC
ts <- as.POSIXct("2020-01-01 02:00:00") + 1:10*3600
ts
#> [1] "2020-01-01 03:00:00 CST" "2020-01-01 04:00:00 CST"
#> [3] "2020-01-01 05:00:00 CST" "2020-01-01 06:00:00 CST"
#> [5] "2020-01-01 07:00:00 CST" "2020-01-01 08:00:00 CST"
#> [7] "2020-01-01 09:00:00 CST" "2020-01-01 10:00:00 CST"
#> [9] "2020-01-01 11:00:00 CST" "2020-01-01 12:00:00 CST"
attr(ts[[1]], "tzone")
#> NULL
arr <- Array$create(ts)
arr
#> Array
#> <timestamp[us]>
#> [
#> 2020-01-01 09:00:00.000000,
#> 2020-01-01 10:00:00.000000,
#> 2020-01-01 11:00:00.000000,
#> 2020-01-01 12:00:00.000000,
#> 2020-01-01 13:00:00.000000,
#> 2020-01-01 14:00:00.000000,
#> 2020-01-01 15:00:00.000000,
#> 2020-01-01 16:00:00.000000,
#> 2020-01-01 17:00:00.000000,
#> 2020-01-01 18:00:00.000000
#> ]
arr$type$timezone()
#> [1] ""
as.vector(arr)
#> [1] "2020-01-01 03:00:00 CST" "2020-01-01 04:00:00 CST"
#> [3] "2020-01-01 05:00:00 CST" "2020-01-01 06:00:00 CST"
#> [5] "2020-01-01 07:00:00 CST" "2020-01-01 08:00:00 CST"
#> [7] "2020-01-01 09:00:00 CST" "2020-01-01 10:00:00 CST"
#> [9] "2020-01-01 11:00:00 CST" "2020-01-01 12:00:00 CST"
attr(as.vector(arr), "tzone")
#> NULL
```
But as I showed up there, when pulling the data back in with
`as.vector(arr)`, the timezone is pulled in with it so that when R displays the
timestamp it is faithful to the original timestamp.
##########
File path: r/tests/testthat/test-Array.R
##########
@@ -985,3 +985,14 @@ test_that("Array to C-interface", {
delete_arrow_schema(schema_ptr)
delete_arrow_array(array_ptr)
})
+
+test_that("Array coverts timestamps with missing timezone /assumed local tz
correctly", {
+ withr::with_envvar(c(TZ = "America/Chicago"), {
+ a <- as.POSIXct("1970-01-01 00:00:00")
+ attr(a, "tzone") <- Sys.getenv("TZ")
Review comment:
Thanks for digging that up, I _assumed_ it existed already but hadn't
gone searching
##########
File path: r/tests/testthat/test-Array.R
##########
@@ -985,3 +985,14 @@ test_that("Array to C-interface", {
delete_arrow_schema(schema_ptr)
delete_arrow_array(array_ptr)
})
+
+test_that("Array coverts timestamps with missing timezone /assumed local tz
correctly", {
+ withr::with_envvar(c(TZ = "America/Chicago"), {
+ a <- as.POSIXct("1970-01-01 00:00:00")
+ attr(a, "tzone") <- Sys.getenv("TZ")
Review comment:
Yeah, let's save the display fixing to ARROW-14567 — I also added a
comment there and the R component since it should all wire up either the same
or very easily after that. Definitely out of scope for this ticket.
> I think we can attach the local / system timezone when it isn't passed
explicitly (and this would theoretically solve this Jira issue).
This sounds like the right approach
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]