[
https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17560679#comment-17560679
]
Jonathan Keane edited comment on ARROW-15805 at 6/29/22 11:30 PM:
------------------------------------------------------------------
This is alluded to in the PR comments, but taking a step back and thinking
about the behavior:
{code}
dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02",
"2022-01-01", "2022-01-01")
dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02",
"2022-01-01", "2022-01-01")
as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-01-01" NA NA NA "2022-01-01"
#> [6] "2022-01-01"
as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA
#> [6] NA
{code}
Which format is chosen and used is dependent on the underlying data, and
critically the order that data is in. Given that we can't always guaranty the
order of the data we are processing[1] we should not attempt to implement this
behavior right now.
Instead, we should have an error message if someone tries to specify
{{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they
want to specify multiple formats (and can accept that you don't get NAs for all
formats other than the first that matches), or they should pick which format
they want to use and use that.
[1] and even if we could, it would take some tricky expression writing to pick
the right format
was (Author: jonkeane):
This is alluded to in the PR comments, but taking a step back and thinking
about the behavior:
{code}
dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02",
"2022-01-01", "2022-01-01")
dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02",
"2022-01-01", "2022-01-01")
as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-01-01" NA NA NA "2022-01-01"
#> [6] "2022-01-01"
as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA
#> [6] NA
{code}
Which format is chosen and used is dependent on the underlying data, and
critically the order that data is in. Given that we can't always guaranty the
order of the data we are processing[1] we should not attempt to implement this
behavior right now. Instead, we should have an error message if someone tries
to specify {{tryFormats}} suggesting that they might use {{lubridate::
as_date()}} if they want to specify multiple formats (and can accept that you
don't get NAs for all formats other than the first that matches), or they
should pick which format they want to use and use that.
[1] and even if we could, it would take some tricky expression writing to pick
the right format
> [R] Update the as.Date() binding
> --------------------------------
>
> Key: ARROW-15805
> URL: https://issues.apache.org/jira/browse/ARROW-15805
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Dragoș Moldovan-Grünfeld
> Priority: Major
> Fix For: 9.0.0
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)