R-Devel,
I store and retrieve a large amount of financial data (millions of rows) in a 
PostgreSQL database keyed by date (and represented in R by class Date). 
Unfortunately, I frequently find that a great deal of processing time is spent 
converting dates from character representations to Date class representations 
in R, presumably because strptime is not fast for large vectors (>10,000 
elements). I'd like to suggest a patch that speeds up the date conversion 
considerably for most every large date vectors (up to 400x in some real life 
cases).

I suspect most everyone with large vectors of class Date will find that most of 
their values are duplicated (repeatedly). (There are, after all, only 36,524 
days in a century.) Given this, as.Date.character can be sped up substantially 
for large vectors by only calling strptime on unique dates and then filling in 
the calculated values for the entire vector. Since the time savings can be 
several minutes in real-life cases, I think this enhancement should certainly 
be considered. Also, in a worst case scenario of a long vector with only one 
duplicated value, the suggested change does not slow down the calculation.

Here's a proof of concept:
as.Date.character2 <- function(x, ...) {
    if (anyDuplicated(x)) {
        ux <- unique(x)
        idx <- match(x, ux)
        y <- as.Date.character(ux, ...)
        return(y[idx])
    }
    as.Date.character(x, ...)
}

## Example1: Construct a 1-million length character vector of 1000 unique dates
## By considering only unique values, speed is >250x faster

> dtch <- format(sample(Sys.Date()-1:1000, 1e6, replace=TRUE))
> system.time(dt1 <- as.Date.character(dtch))
   user  system elapsed 
 12.630  23.628  36.262
> system.time(dt2 <- as.Date.character2(dtch))
   user  system elapsed 
  0.117   0.019   0.136 
> identical(dt1, dt2)
[1] TRUE


## Example2: In a "worst case" scenario of a 1,000,002 length character of 
1,000,001 unique dates
## the new function is not any slower (within error).
> dtch <- format(c(Sys.Date(), Sys.Date()+-5e5:5e5))
> system.time(dt1 <- as.Date.character(dtch))
   user  system elapsed 
 20.264  25.584  45.855
> system.time(dt2 <- as.Date.character2(dtch))
   user  system elapsed 
 20.525  24.809  45.335 
> identical(dt1, dt2)
[1] TRUE

Alternatively, this logic should be built in to strptime itself.

Robert

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to