Hi Rui, Marc, and Gabor,

Thanks for your replies to my question. All were helpful and it was interesting 
to see how different people approach various aspects of the same problem.

Spent some time this weekend looking at Rui's solution, which is certainly much 
clearer than my own. Managed to figure out pretty much all the details of how 
it works. Also managed to tweak it slightly in order to make it do exactly what 
I wanted. (See revised code below.)

Still have a couple of questions though. The first concerns the insertion of 
the code "Y > 2012" to set year values beyond 2012 to NA (on line 10 of the 
function below).  When I add this (or use it in place of "nchar(Y) > 4"), the 
code succesfully finds the problem date "05/16/2015". After that though, it 
produces the following error message:

Error in if (any(is.na(x) & M != "un" & Y != "un")) cat("Warning: Invalid date 
values in",  :  missing value where TRUE/FALSE needed

Why is this happening? If the code correctly correctly handles the date 
"06/20/1840" without producing an error, why can't it do likelwise with 
"05/16/2015"?

The second question is why it's necessary to put "x" on line 15 following 
"cat("Warning ...)". I know that I don't get any date columns if I don't 
include this but am not sure why.

The third question is whether it's possible to change the class of the date 
variables without using a for loop. I played around with this a little but 
didn't find a vectorized alternative. It may be that this is not really 
important. It's just that I've read in several places that for loops should be 
avoided wherever possible.

Thanks,

Paul 


##########################################
#### Code for detecting invalid dates ####
##########################################

#### Test Data ####

connection <- textConnection("
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1840  02/30/2010 03/17/2011
3 06/17/1935  12/20/2008 07/un/2011
4 05/31/1937  01/18/2007 04/30/2011
5 06/31/1933  05/16/2015 11/20/un
")

TestDates <- data.frame(scan(connection, 
                 list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))

close(connection)

#### Input Data ####

TDSaved <- TestDates

#### List of Date Variables ####

DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")

#### Date Function ####

fun <- function(Dat){
    f <- function(jj, DF){
        x <- as.character(DF[, jj])
        x <- unlist(strsplit(x, "/"))
        n <- length(x)
        M <- x[seq(1, n, 3)]
        D <- x[seq(2, n, 3)]
        Y <- x[seq(3, n, 3)]
        D[D == "un"] <- "15"
        Y <- ifelse(nchar(Y) > 4 | Y > 2012 | Y < 1900, NA, Y)
        x <- as.Date(paste(Y, M, D, sep="-"), format="%Y-%m-%d")
        if(any(is.na(x) & M != "un" & Y != "un"))
            cat("Warning: Invalid date values in", jj, "\n",
                as.character(DF[is.na(x), jj]), "\n")
        x
    }
    Dat <- data.frame(sapply(names(Dat), function(j) f(j, Dat)))
    for(i in names(Dat)) class(Dat[[i]]) <- "Date"
    Dat
}

#### Output Data ####

TD <- TDSaved

#### Read Dates ####

TD[, DateNames] <- fun(TD[, DateNames])
TD

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to