[
https://issues.apache.org/jira/browse/ARROW-18242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629023#comment-17629023
]
Nicola Crane commented on ARROW-18242:
--------------------------------------
OK, my best guess as to what is going on here is that the original lubridate
implementation uses a custom C parser to process these datetimes, and in the
Arrow implementation some of this work is being done on whichever external
library is being depended on for datetimes, which is why there's a difference
between Windows and Linux. We might be able to add some additional
pre-processing steps to our bindings (or the regex the setup code for them
produces) to prevent this.
> [R] arrow implementation of lubridate::dmy parses invalid date "00001976" as
> date
> ---------------------------------------------------------------------------------
>
> Key: ARROW-18242
> URL: https://issues.apache.org/jira/browse/ARROW-18242
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Lucas Mation
> Priority: Critical
>
> Sorry for so many issues, but I think this is another bug.
> Wrong behavior of the arrow implementation of the `lubridate::dmy`.
> An invalid date such as '00001976' is being parsed as a valid (and completely
> unrelated) date.
> #in R
> '00001976' %>% dmy
> [1] NA
> Warning message:
> All formats failed to parse. No formats found.
> #In arrow
> q <- data.table(x=c('00001976','30111976','01011976'))
> q %>% write_dataset('q')
> q2 <- 'q' %>% open_dataset %>% mutate(x2=dmy) %>% collect
> q2
> x
> 1: 1975-11-30
> 2: 1976-11-30
> 3: 1976-01-01
> #notice '00001976' is an invalid date. First row of x2 should be NA!!!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)