[ 
https://issues.apache.org/jira/browse/ARROW-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-3543:
-----------------------------------
    Description: 
See below for original description and reports. In sum, there is a mismatch 
between how the C++ library and R interpret data without a timezone, and it 
turns out that we're not passing the timezone to R if it is set in Arrow C++ 
anyway. 

The C++ library docs say "If a timezone-aware field contains a recognized 
timezone, its values may be localized to that locale upon display; the values 
of timezone-naive fields must always be displayed “as is”, with no localization 
performed on them." But R's print default is the current time zone: 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html

My guess is that readr::read_delim interprets timestamps without a timezone to 
be the current time zone, but arrow C++ interprets that as UTC, which becomes a 
problem when R tries to print the timestamp.

I'm guessing that if you did print(df$Date, tz="GMT") it would look as you 
expect.

Other fun fact I saw while digging in: the arrow-to-vector code for timestamp 
doesn't seem to consider time zone information if it does exist, so we should 
handle that too. 
https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514

-----
Original description:

Hello the dream team,

Pasting from [https://github.com/wesm/feather/issues/351]

Thanks for this wonderful package. I was playing with feather and some 
timestamps and I noticed some dangerous behavior. Maybe it is a bug.

Consider this

 
{code:java}
import pandas as pd
import feather
import numpy as np
df = pd.DataFrame(
{'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
14:01:02.200')]}
)
df['timestamp_est'] = 
pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
df
 Out[17]: 
 string_time_utc timestamp_est
 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
{code}
Here I create the corresponding `EST` timestamp of my original timestamps (in 
`UTC` time).

Now saving the dataframe to `csv` or to `feather` will generate two completely 
different results.

 
{code:java}
df.to_csv('P://testing.csv')
df.to_feather('P://testing.feather')
{code}
Switching to R.

Using the good old `csv` gives me something a bit annoying, but expected. R 
thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
`timestamp_est`. No big deal, I can always use `with_tz` or even better: import 
as character and process as timestamp while in R.

 
{code:java}
> dataframe <- read_csv('P://testing.csv')
 Parsed with column specification:
 cols(
 X1 = col_integer(),
 string_time_utc = col_datetime(format = ""),
 timestamp_est = col_datetime(format = "")
 )
 Warning message:
 Missing column names filled in: 'X1' [1] 
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 4
 X1 string_time_utc timestamp_est 
 <int> <dttm> <dttm> 
 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
 mytimezone
 <chr> 
 1 UTC 
 2 UTC 
 3 UTC  {code}
{code:java}
#Now look at what happens with feather:
 
 > dataframe <- read_feather('P://testing.feather')
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 3
 string_time_utc timestamp_est mytimezone
 <dttm> <dttm> <chr> 
 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
My timestamps have been converted!!! pure insanity. 
 Am I missing something here?

Thanks!!

  was:
Hello the dream team,

Pasting from [https://github.com/wesm/feather/issues/351]

Thanks for this wonderful package. I was playing with feather and some 
timestamps and I noticed some dangerous behavior. Maybe it is a bug.

Consider this

 
{code:java}
import pandas as pd
import feather
import numpy as np
df = pd.DataFrame(
{'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
14:01:02.200')]}
)
df['timestamp_est'] = 
pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
df
 Out[17]: 
 string_time_utc timestamp_est
 0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
{code}
Here I create the corresponding `EST` timestamp of my original timestamps (in 
`UTC` time).

Now saving the dataframe to `csv` or to `feather` will generate two completely 
different results.

 
{code:java}
df.to_csv('P://testing.csv')
df.to_feather('P://testing.feather')
{code}
Switching to R.

Using the good old `csv` gives me something a bit annoying, but expected. R 
thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
`timestamp_est`. No big deal, I can always use `with_tz` or even better: import 
as character and process as timestamp while in R.

 
{code:java}
> dataframe <- read_csv('P://testing.csv')
 Parsed with column specification:
 cols(
 X1 = col_integer(),
 string_time_utc = col_datetime(format = ""),
 timestamp_est = col_datetime(format = "")
 )
 Warning message:
 Missing column names filled in: 'X1' [1] 
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 4
 X1 string_time_utc timestamp_est 
 <int> <dttm> <dttm> 
 1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
 2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
 3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
 mytimezone
 <chr> 
 1 UTC 
 2 UTC 
 3 UTC  {code}
{code:java}
#Now look at what happens with feather:
 
 > dataframe <- read_feather('P://testing.feather')
 > 
 > dataframe %>% mutate(mytimezone = tz(timestamp_est))

A tibble: 3 x 3
 string_time_utc timestamp_est mytimezone
 <dttm> <dttm> <chr> 
 1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
 2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
 3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
My timestamps have been converted!!! pure insanity. 
 Am I missing something here?

Thanks!!


> [R] Better support for timestamp format and time zones in R
> -----------------------------------------------------------
>
>                 Key: ARROW-3543
>                 URL: https://issues.apache.org/jira/browse/ARROW-3543
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Olaf
>            Priority: Major
>             Fix For: 1.0.0
>
>
> See below for original description and reports. In sum, there is a mismatch 
> between how the C++ library and R interpret data without a timezone, and it 
> turns out that we're not passing the timezone to R if it is set in Arrow C++ 
> anyway. 
> The C++ library docs say "If a timezone-aware field contains a recognized 
> timezone, its values may be localized to that locale upon display; the values 
> of timezone-naive fields must always be displayed “as is”, with no 
> localization performed on them." But R's print default is the current time 
> zone: https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html
> My guess is that readr::read_delim interprets timestamps without a timezone 
> to be the current time zone, but arrow C++ interprets that as UTC, which 
> becomes a problem when R tries to print the timestamp.
> I'm guessing that if you did print(df$Date, tz="GMT") it would look as you 
> expect.
> Other fun fact I saw while digging in: the arrow-to-vector code for timestamp 
> doesn't seem to consider time zone information if it does exist, so we should 
> handle that too. 
> https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L504-L514
> -----
> Original description:
> Hello the dream team,
> Pasting from [https://github.com/wesm/feather/issues/351]
> Thanks for this wonderful package. I was playing with feather and some 
> timestamps and I noticed some dangerous behavior. Maybe it is a bug.
> Consider this
>  
> {code:java}
> import pandas as pd
> import feather
> import numpy as np
> df = pd.DataFrame(
> {'string_time_utc' : [pd.to_datetime('2018-02-01 14:00:00.531'), 
> pd.to_datetime('2018-02-01 14:01:00.456'), pd.to_datetime('2018-03-05 
> 14:01:02.200')]}
> )
> df['timestamp_est'] = 
> pd.to_datetime(df.string_time_utc).dt.tz_localize('UTC').dt.tz_convert('US/Eastern').dt.tz_localize(None)
> df
>  Out[17]: 
>  string_time_utc timestamp_est
>  0 2018-02-01 14:00:00.531 2018-02-01 09:00:00.531
>  1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
> {code}
> Here I create the corresponding `EST` timestamp of my original timestamps (in 
> `UTC` time).
> Now saving the dataframe to `csv` or to `feather` will generate two 
> completely different results.
>  
> {code:java}
> df.to_csv('P://testing.csv')
> df.to_feather('P://testing.feather')
> {code}
> Switching to R.
> Using the good old `csv` gives me something a bit annoying, but expected. R 
> thinks my timezone is `UTC` by default, and wrongly attached this timezone to 
> `timestamp_est`. No big deal, I can always use `with_tz` or even better: 
> import as character and process as timestamp while in R.
>  
> {code:java}
> > dataframe <- read_csv('P://testing.csv')
>  Parsed with column specification:
>  cols(
>  X1 = col_integer(),
>  string_time_utc = col_datetime(format = ""),
>  timestamp_est = col_datetime(format = "")
>  )
>  Warning message:
>  Missing column names filled in: 'X1' [1] 
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 4
>  X1 string_time_utc timestamp_est 
>  <int> <dttm> <dttm> 
>  1 0 2018-02-01 14:00:00.530 2018-02-01 09:00:00.530
>  2 1 2018-02-01 14:01:00.456 2018-02-01 09:01:00.456
>  3 2 2018-03-05 14:01:02.200 2018-03-05 09:01:02.200
>  mytimezone
>  <chr> 
>  1 UTC 
>  2 UTC 
>  3 UTC  {code}
> {code:java}
> #Now look at what happens with feather:
>  
>  > dataframe <- read_feather('P://testing.feather')
>  > 
>  > dataframe %>% mutate(mytimezone = tz(timestamp_est))
> A tibble: 3 x 3
>  string_time_utc timestamp_est mytimezone
>  <dttm> <dttm> <chr> 
>  1 2018-02-01 09:00:00.531 2018-02-01 04:00:00.531 "" 
>  2 2018-02-01 09:01:00.456 2018-02-01 04:01:00.456 "" 
>  3 2018-03-05 09:01:02.200 2018-03-05 04:01:02.200 "" {code}
> My timestamps have been converted!!! pure insanity. 
>  Am I missing something here?
> Thanks!!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to