Re: [R] How to parse a really silly date with lubridate

avi.e.gross Wed, 13 Jul 2022 11:38:55 -0700

Eberhard,

Others have supplied ways to do this using various date management functions. 
But I want to add another option that may make sense if the dates are not all 
quite as predictable.


You can use your own regular expressions in R as in most languages, that try to 
match each entry to one or more patterns you supply. When it matches a pattern, 
you can capture various bits of what you find and recombine them into any valid 
date format you want or convert them into some date type.

You just showed two pseudo-date-time entries:

         1 9. Jul 2022 at 11:39
         2 10. Jul 2022 at 01:58

It looks like the first item in each is a sequence number that can be ignored. 
But I assume after the ninth, it takes up ever more space as it becomes 10 then 
100 and so on. It would thus be of varying width. So, in English, you are 
looking for a pattern like:

- beginning of the line/entry
- one or more digits in 0-9 followed by what looks like a space, but maybe 
arbitrary whitespace.
- capture one or two more digits if followed by a period, but not the period.
- capture what looks like a three-letter character string representing a month 
as a "word" but not what follows.
- whitespace followed by "at" followed by more whitespace can be ignored but 
must be present.
- capture a two digit number perhaps including the colon and another two-digit 
number, or two parts without the colon.
- match the end of the entry

If it looks like the above, and you captured the parts needed, you can 
reconstitute the parts, albeit perhaps first checking to see if the month names 
are valid (which can be handled by a pattern part that looks like 
"(Jan|Feb|...|Dec)" and of course making sure the other parts also fit 
requirements such as days as integers being between 1 and 31 and similar issues 
about time. My guess is that you don't really want to validate that deeply, let 
alone do the validation within the regular expression.

But when your data ends up having multiple ways of saying the date that you can 
enumerate and tell apart, this can be a way to parse them and identify what it 
matches and so on.

In the above case, full regular expressions may be overkill nd regular string 
functions might suffice.  There are many ways, including something as simple as 
this:

dt <- "1 9. Jul 2022 at 11:39"
dt_parts <- strsplit(dt, " ")

Unfortunately, this returns a vector of parts with a decimal point still in 
place:

> dt_parts
[1] "1"     "9."    "Jul"   "2022"  "at"    "11:39"

You can now ignore  the parts in the first fifth positions as the sequence 
number and "at" are meaningless and use these:

the_day <- dt_parts[2]
the_month <- dt_parts[3]
the_year <- dt_parts[4]
the_time <- dt_parts[6]

Of course, you now may want further processing using one of many ways to turn 
something like "9." (note there may be a "12." too) or even make it an integer 
instead of a character string. You may want to change the dates into numerals 
with something like this:

month_conversion <- list("Jan"="01", Feb="02", Mar="03", Apr="04", May="05", 
Jun="06", 
                         Jul="07", Aug="08", Sep="09", Oct="10", Nov="11", 
Dec="12")
month_numeric_string <- month_conversion[the_month]


My point is that if your format is consistent, you can pick out the parts you 
want as strings or make them decimals or whatever you need and use string 
functions, such as paste0() to recombine them into the format you want. 

Of course if you can tell some date-related function how to do it for you, that 
is obviously simpler and better for you.

And, obviously, it would be even nicer if you could get them to not give dates 
in this format!

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Rui Barradas
Sent: Wednesday, July 13, 2022 10:48 AM
To: Dr Eberhard Lisse <nos...@lisse.na>; r-help@r-project.org
Subject: Re: [R] How to parse a really silly date with lubridate

Hello,

With lubridate, note in what sequence your datetime elements occur and use the 
appropriate function.


d <- c('9. Jul 2022 at 11:39', '10. Jul 2022 at 01:58')
lubridate::dmy_hm(d)
#> [1] "2022-07-09 11:39:00 UTC" "2022-07-10 01:58:00 UTC"


Hope this helps,

Rui Barradas

Às 14:40 de 13/07/2022, Dr Eberhard Lisse escreveu:
> 
> Hi,
> 
> I have data file which generated by an otherwise very nice (diabetes
> log) app, but exports dates really silly.
> 
> After reading the enclosed mwe.csv into R like so
> 
>       MWE <- read_delim('mwe.csv', delim = ';') %>%
>          select(Date) %>%
>          print()
> 
> 
> this comes out as:
> 
>       # A tibble: 2 × 1
>      Date
>      <chr>
>       1 9. Jul 2022 at 11:39
>       2 10. Jul 2022 at 01:58
> 
> 
> No matter what I try I am not able to parse this inside R to get at
> proper dates (I have loaded tidyverse and lubridate).
> 
> I can easily do somethig
> 
>       csvq  -d ';' -t '%e. %b %Y at %H:%i' \
>          'SELECT Date as oridate,
>              DATETIME_FORMAT(Date, "%Y-%m-%d %H:%m") AS date
>          FROM mwe'
> 
>       +-----------------------+------------------+
>       |        oridate        |       date       |
>       +-----------------------+------------------+
>       | 9. Jul 2022 at 11:39  | 2022-07-09 11:07 |
>       | 10. Jul 2022 at 01:58 | 2022-07-10 01:07 |
>       +-----------------------+------------------+
> 
> and hence could easily do something like
> 
>       csvq  -d ';' -t '%e. %b %Y at %H:%i' \
>        'ALTER mwe
>        SET Date = DATETIME_FORMAT(Date, "%Y-%m-%d %H:%m")'
> 
> but would rather like to be able to do it inside R and would therefor
> appreciate any advice in this regard.
> 
> 
> greetings, el
>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to parse a really silly date with lubridate

Reply via email to