[ 
https://issues.apache.org/jira/browse/ARROW-17905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611928#comment-17611928
 ] 

Neal Richardson commented on ARROW-17905:
-----------------------------------------

You're right. I saw that the column was a datetime in R (Sys.time()) but since 
it is used in partitioning, it is stringified in the directory name. 

> [R] as_date and similar methods fail with digit seconds
> -------------------------------------------------------
>
>                 Key: ARROW-17905
>                 URL: https://issues.apache.org/jira/browse/ARROW-17905
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 9.0.0
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Arrow 9.0  R client introduced support for dates with lubridate (and base R 
> as.Date()) functions, which is awesome. 
> However, these functions fail to handle decimal dates.  This will especially 
> confuse R users because the native R functions work as expected, and R users 
> will not realize the metaprogramming translation.  Easiest to see this in a 
> minimal reprex:
> {code:java}
> library(arrow); library(lubridate); library(dplyr){code}
> {code:java}
> f <- tempfile()
> data.frame(t = Sys.time(), A = 1) |>
>   write_dataset(f, partitioning = "t")
> # ERRORS
> open_dataset(f) |> mutate(as_date(t)) |> collect() {code}
> This errors with message:
> {code:java}
> open_dataset(f) |> mutate(as_date(t)) |> collect()
> Error in `collect()`:
> ! Invalid: Failed to parse string: '2022-09-30 22:03:32.123248' as a scalar 
> of type timestamp[s] {code}
> Which is strange because lubridate::as_date('2022-09-30 22:03:32.123248') 
> works fine.  
> It's easy to see the cause of the error prior to collect:
> {code:java}
> as_date(t): date32[day] (cast(strptime(t, {format="%Y-%m-%d", unit=SECOND, 
> error_is_null=false}), {to_type=date32[day], allow_int_overflow=false, 
> allow_time_truncate=false, allow_time_overflow=false, 
> allow_decimal_truncate=false, allow_float_truncate=false, 
> allow_invalid_utf8=false})){code}
> We can see a lot of assumptions there about units of parsing, but afaik from 
> R we have no way to control them.  The issue is particularly ironic because 
> as you see in my example, the column has only become a string because we used 
> it as a partition.  So arrow coerced the timestamp to a string originally 
> (using microsecond precision – which is an understandable choice because it 
> is loss-less, though it is different from R's as.character() behavior).  But 
> ironically, now arrow doesn't understand how to reverse it's own 
> timestamp->string behavior to get a back to a timestamp!  
> Ideally the user would have more control of these, and the default 
> assumptions would be consistent.  Ideally, as_datetime, as_date, etc should 
> not choke regardless of the precision of the seconds, matching the existing 
> behavior of the base R (as.Date etc) and lubridate functions. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to