[
https://issues.apache.org/jira/browse/ARROW-17905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611796#comment-17611796
]
Carl Boettiger commented on ARROW-17905:
----------------------------------------
Very cool. I actually tried to do that but can't figure out the syntax for
dplyr. looks like I should even be able to pass dplyr the literal cast()
command and modify the units or truncation options, but couldn't get that one
to work, or any of the functions listed in `list_compute_functions()`. (e.g. I
tried:
{code:java}
open_dataset(f) |> mutate(t = arrow_ascii_ltrim(t,10)) |> collect()
{code}
I did get it to work using substr(),
{code:java}
open_dataset(f) |> mutate(t = substr(t,1,10)) |> collect() {code}
which kinda surprised me because substr wasn't listed in
list_compute_functions(), and most other base or dplyr verbs that trim strings
failed. (e.g. strtrim() isn't recognized, nor is stringr::str_trim() ).
Is there a list of what R functions like substr() and as_date() that arrow
understands?
(Also would be great to have more examples of using the compute functions with
dplyr)
Anyway thanks! I'll keep an eye on the issue you mentioned. We can close this
one out.
> [R] as_date and similar methods fail with digit seconds
> -------------------------------------------------------
>
> Key: ARROW-17905
> URL: https://issues.apache.org/jira/browse/ARROW-17905
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 9.0.0
> Reporter: Carl Boettiger
> Priority: Major
>
> Arrow 9.0 R client introduced support for dates with lubridate (and base R
> as.Date()) functions, which is awesome.
> However, these functions fail to handle decimal dates. This will especially
> confuse R users because the native R functions work as expected, and R users
> will not realize the metaprogramming translation. Easiest to see this in a
> minimal reprex:
> {code:java}
> library(arrow); library(lubridate); library(dplyr){code}
> {code:java}
> f <- tempfile()
> data.frame(t = Sys.time(), A = 1) |>
> write_dataset(f, partitioning = "t")
> # ERRORS
> open_dataset(f) |> mutate(as_date(t)) |> collect() {code}
> This errors with message:
> {code:java}
> open_dataset(f) |> mutate(as_date(t)) |> collect()
> Error in `collect()`:
> ! Invalid: Failed to parse string: '2022-09-30 22:03:32.123248' as a scalar
> of type timestamp[s] {code}
> Which is strange because lubridate::as_date('2022-09-30 22:03:32.123248')
> works fine.
> It's easy to see the cause of the error prior to collect:
> {code:java}
> as_date(t): date32[day] (cast(strptime(t, {format="%Y-%m-%d", unit=SECOND,
> error_is_null=false}), {to_type=date32[day], allow_int_overflow=false,
> allow_time_truncate=false, allow_time_overflow=false,
> allow_decimal_truncate=false, allow_float_truncate=false,
> allow_invalid_utf8=false})){code}
> We can see a lot of assumptions there about units of parsing, but afaik from
> R we have no way to control them. The issue is particularly ironic because
> as you see in my example, the column has only become a string because we used
> it as a partition. So arrow coerced the timestamp to a string originally
> (using microsecond precision – which is an understandable choice because it
> is loss-less, though it is different from R's as.character() behavior). But
> ironically, now arrow doesn't understand how to reverse it's own
> timestamp->string behavior to get a back to a timestamp!
> Ideally the user would have more control of these, and the default
> assumptions would be consistent. Ideally, as_datetime, as_date, etc should
> not choke regardless of the precision of the seconds, matching the existing
> behavior of the base R (as.Date etc) and lubridate functions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)