Hello Everyone,

Still new to R. Wrote some code that finds and prints invalid dates (see 
below). This code works but I suspect it's not very good. If someone could show 
me a better way, I'd greatly appreciate it.

Here is some information about what I'm trying to accomplish. My sense is that 
the R date functions are best at identifying invalid dates when fed character 
data in their default format. So my code converts the input dates to character, 
breaks them apart using strsplit, and then reformats them. It then identifies 
which dates are "missing" in the sense that the month or year are unknown and 
prints out any remaining invalid date values. 

As I see it, the code has at least 4 shortcomings.

1. It's too long. My understanding is that skilled programmers can usually or 
often complete tasks like this in a few lines.

2. It's not vectorized. I started out trying to do something that was 
vectorized but ran into problems with the strsplit function. I looked at the 
help file and it appears this function will only accept a single character 
vector.

3. It prints out the incorrect dates but doesn't indicate which date variable 
they belong to. I tried various things with paste but never came up with 
anything that worked. Ideally, I'd like to get something that looks roughly 
like:

Error: Invalid date values in birthDT

"21931-11-23" 
"1933-06-31"

Error: Invalid date values in diagnosisDT

"2010-02-30"

4. There's no way to specify names for input and output data. I imagine this 
would be fairly easy to specify this in the arguments to a function but am not 
sure how to incorporate it into a for loop.

Thanks,

Paul  

##########################################
#### Code for detecting invalid dates ####
##########################################

#### Test Data ####

connection <- textConnection("
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1940  02/30/2010 03/17/2011
3 06/17/1935  12/20/2008 07/un/2011
4 05/31/1937  01/18/2007 04/30/2011
5 06/31/1933  05/16/2009 11/20/un
")

TestDates <- data.frame(scan(connection, 
                 list(Patient=0, birthDT="", diagnosisDT="", metastaticDT="")))

close(connection)

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

#### List of Date Variables ####

DateNames <- c("birthDT", "diagnosisDT", "metastaticDT")

#### Read Dates ####

for (i in seq(TestDates[DateNames])){
TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
TestDates$Day[TestDates$Day=="un"] <- "15"
TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = "-"))
is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d")
TestDates$Invalid <- ifelse(is.na(TestDates$Date) & 
!is.na(TestDates[DateNames][[i]]), 1, 0)
if( sum(TestDates$Invalid)==0 ) 
        { TestDates[DateNames][[i]] <- TestDates$Date } else
        { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, 
Invalid))
}

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to