[ 
https://issues.apache.org/jira/browse/ARROW-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481056#comment-17481056
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14442 at 1/24/22, 1:22 PM:
----------------------------------------------------------------------------

If I understand correctly how timestamps (with a missing tz) work in R and how 
they are converted to arrow, it is not enough to store the integer value R 
passes to us together with the local timezone, because that timezone is not 
used when during the conversion - it is mostly metadata.

Therefore, "1970-01-01" in "BST" will always be incorrect by a hour (BST is UTC 
+0100). I think we need to account for the offset too. Without correcting for 
the offset, we have the correct timezone, but the wrong time. See below.
{code:r}
> a <- as.POSIXct("1970-01-01")
# the print method adds local tz when it is unspecified
> a 
[1] "1970-01-01 BST"

> attributes(a)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] ""

> attr(a, "tzone") <- Sys.timezone()
> attributes(a)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] "Europe/London"
# print result looks the same as with an unspecified `tzone` attribute
> a
[1] "1970-01-01 BST"

# yet this is not enough for conversion to arrow, which makes no use of the 
tzone attribute and converts the equivalent UTC time, but with the desired 
timezone and, thus, introduces a "mistake".
> Array$create(a)
Array
<timestamp[us, tz=Europe/London]>
[
  1969-12-31 23:00:00.000000
]
{code}


was (Author: dragosmg):
If I understand correctly how timestamps (with a missing tz) work in R and how 
they are converted to arrow, it is not enough to store the integer value R 
passes to us together with the local timezone, because that timezone is not 
used when during the conversion - it is mostly metadata.

Therefore, "1970-01-01" in "BST" will always be incorrect by a hour (BST is UTC 
+0100). I think we need to account for the offset too. Without correcting for 
the offset, we have the correct timezone, but the wrong time. See below.
{code:r}
> a <- as.POSIXct("1970-01-01")
# the print method adds local tz when it is unspecified
> a 
[1] "1970-01-01 BST"

> attributes(a)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] ""

> attr(a, "tzone") <- Sys.timezone()
> attributes(a)
$class
[1] "POSIXct" "POSIXt" 

$tzone
[1] "Europe/London"
# print result looks the same as with an unspecified `tzone` attribute
> a
[1] "1970-01-01 BST"

# yet this is not enough for conversion to arrow, which makes no use of the 
tzone attribute and converts the equivalent UTC time, but with the desired 
timezone.
> Array$create(a)
Array
<timestamp[us, tz=Europe/London]>
[
  1969-12-31 23:00:00.000000
]
{code}

> [R] Should we warn when converting timestamps with "" as tzone?
> ---------------------------------------------------------------
>
>                 Key: ARROW-14442
>                 URL: https://issues.apache.org/jira/browse/ARROW-14442
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Jonathan Keane
>            Assignee: Dragoș Moldovan-Grünfeld
>            Priority: Major
>
> Form the comments, we've decided to go with option 3:
> * Set the timezone to local time without changing the integer value fo the 
> timestamp. We store whatever integer R passes to us (21600), with CST as the 
> timezone set. Display is then "1970-01-01 00:00:00 CST"
> This is surprising because we are asserting the local timezone when that is 
> not specified in R.
> ============================================
> {{POSIXct}} in R can have timezones specified as {{""}} which is typically 
> interpreted as the session local timezone. 
> This can lead to surprising results like:
> {code:r}
> > Sys.timezone()
> [1] "America/Chicago"
> > as.integer(as.POSIXct("1970-01-01"))
> [1] 21600
> > Sys.setenv(TZ = "UTC")
> > as.integer(as.POSIXct("1970-01-01"))
> [1] 0
> > Sys.setenv(TZ = "Australia/Brisbane")
> > as.integer(as.POSIXct("1970-01-01"))
> [1] -36000
> {code}
> See also: 
> https://stackoverflow.com/questions/69670142/how-can-i-store-timezone-agnostic-dates-for-sharing-between-r-and-python-using-p/69678923#69678923
>  
> This runs counter to what timestamps without timezones are interpreted as in 
> Arrow: 
> https://github.com/apache/arrow/blob/03669438bbce53078616c7f943a63fb0c11db196/format/Schema.fbs#L333-L336
> > However, it may also be encoded into a Timestamp column with an empty 
> > timezone. The timestamp values should be computed "as if" the timezone of 
> > the date-time values was UTC; for example, the naive date-time "January 1st 
> > 1970, 00h00" would be encoded as timestamp value 0.
> Critically in R, when {{as.POSIXct("1970-01-01 00:00:00")}} is run, the 
> timestamp value is computed "as if" the timezone of the date-time values was 
> the local timezone (and *not* UTC like the Arrow spec says).
> This can lead to some surprising results when converting these timezoneless 
> timestamps from R to Arrow. Using {{as.POSIXct("1970-01-01 00:00:00")}} as an 
> example, and presume US Central time.  We have a few options:
> * Warn when the timezone is "" or not set that the behavior might be 
> surprising
>   We store whatever integer R passes to us (21600), with no timezone set. 
> When someone sees this formatted, the times/dates will be what the time was 
> at UTC ("1970-01-01 06:00:00")
> * Set the timezone to UTC without changing the integer value of the 
> timestamp.   We store whatever integer R passes to us (21600), with UTC as 
> the timezone set. When someone sees this formatted, the times/dates will be 
> in UTC ("1970-01-01 06:00:00 UTC") This might be surprising / 
> counterintuitive because the timestamps will suddenly be different and will 
> be based in UTC and not local time like people are expecting.
> * Set the timezone to local time without changing the integer value fo the 
> timestamp. We store whatever integer R passes to us (21600), with CST as the 
> timezone set. Display is then "1970-01-01 00:00:00 CST"
> This is surprising because we are asserting the local timezone when that is 
> not specified in R.
> If someone is using a timestamp without tzone in R to represent a 
> timezoneless timestamp, options 2 and 3 above violate that when it is put 
> into Arrow. Whereas, if someone is using a timestamp that just so happens to 
> be without a tzone but they assume it's in local time, option 1 leads to 
> (very) surprising results



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to