Re: [R] Checking for invalid dates: Code works but needs improvement

2012-02-04 Thread Paul Miller
Hi David and Rui,

Sorry to be so slow in replying. Thank you both for pointing out that the 
problem with my code was that I was using comparison operators on mixed data 
types. This is something I'll have to be more careful about in the future.

In an earlier email, David talked about how R can seem uncooperative or even 
unfair when you're just starting out. I too have had this experience, but it 
seems less unfair each time I use it. This time, I was able to write 
inelegant but functional code to solve my problem. Last time, I wasn't able to 
solve a much simpler problem at all. So I guess that's a kind of progress. 

At this point, I have serviceable code for checking my dates. I can improve 
this when I begin to develop some real skill as an R programmer, but it will do 
nicely for now.

Thanks everyone for your help with this.

Paul

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Checking for invalid dates: Code works but needs improvement

2012-02-01 Thread Paul Miller
Hi Marc,

That the code I wrote initially is over engineered is certainly possible. Of 
course, Rui's solution is a reworking of that code. If starting from scratch, 
Rui likely would have done something quite different. I focused on Rui's code 
because it was complete and was a clear improvement over what I had intially.

Your code is great but doesn't do everything I think I need.

This leads to the question of exactly what I do need. I thought I had done well 
to create sample data and working (albeit inelegant) code. I realize now though 
that this really wasn't sufficient. It would have been better to supply a list 
of things I would like the code to do.

So let me try again. Ideally I'd like to have a function called something like 
readDates, a call to which might look like: 

readDates(indata=TestDates, outdata=TestDates2, 
  dates=c(birthDT, diagnosisDT, metastaticDT), 
  datefmt=%m/%d/%Y, monthimp=15, 
  mindt=1900-01-01, maxdt=Sys.Date())

Besides allowing the user to specify an input data name, an output data name, 
the dates to be read, and an incoming date format, the readDates function would:

1. Impute by default the 15th of month if it  is 'un', 'unk', 'Un', 'Unk', 
'UN', etc, but allow the user to select another value such as the 1st. 

2. Reject by default dates before 1900-01-01 or after the current date, but 
allow the user to specify other values.

3. Ignore dates with month or year values of 'un', 'unk', 'Un', 'Unk', 'UN', 
etc. That is, set them to missing but not report them as part of a warning 
message.

4. Reject dates with components (month, day, or year) that are not of the 
correct length. In most cases, I think this would involve lengths of 2,2, and 
4. For some date formats though (e.g., 05Jan2012), this might not be the case.

5. Print warning messages for invalid dates something like: 

Warning: Invalid date values in birthDT 

11/23/21931 
06/20/1840 
06/31/1933 

Warning: Invalid date values in diagnosisDT 

02/30/2010 
05/16/2015 
 
6. Convert to a date any input columns that do not have invalid dates. This 
would include columns with unknown month and year values, like my 
metastaticDT. 

7. Allow things like the date format and minimum and maximum date values to 
vary by input column.

Admittedly, this is a lot. And I wouldn't blame you if you didn't want to touch 
it with a ten-foot pole. 

It's what's on my wish list though.

Thanks,

Paul

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Checking for invalid dates: Code works but needs improvement

2012-01-30 Thread Paul Miller
Hi Rui, Marc, and Gabor,

Thanks for your replies to my question. All were helpful and it was interesting 
to see how different people approach various aspects of the same problem.

Spent some time this weekend looking at Rui's solution, which is certainly much 
clearer than my own. Managed to figure out pretty much all the details of how 
it works. Also managed to tweak it slightly in order to make it do exactly what 
I wanted. (See revised code below.)

Still have a couple of questions though. The first concerns the insertion of 
the code Y  2012 to set year values beyond 2012 to NA (on line 10 of the 
function below).  When I add this (or use it in place of nchar(Y)  4), the 
code succesfully finds the problem date 05/16/2015. After that though, it 
produces the following error message:

Error in if (any(is.na(x)  M != un  Y != un)) cat(Warning: Invalid date 
values in,  :  missing value where TRUE/FALSE needed

Why is this happening? If the code correctly correctly handles the date 
06/20/1840 without producing an error, why can't it do likelwise with 
05/16/2015?

The second question is why it's necessary to put x on line 15 following 
cat(Warning ...). I know that I don't get any date columns if I don't 
include this but am not sure why.

The third question is whether it's possible to change the class of the date 
variables without using a for loop. I played around with this a little but 
didn't find a vectorized alternative. It may be that this is not really 
important. It's just that I've read in several places that for loops should be 
avoided wherever possible.

Thanks,

Paul 


##
 Code for detecting invalid dates 
##

 Test Data 

connection - textConnection(
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1840  02/30/2010 03/17/2011
3 06/17/1935  12/20/2008 07/un/2011
4 05/31/1937  01/18/2007 04/30/2011
5 06/31/1933  05/16/2015 11/20/un
)

TestDates - data.frame(scan(connection, 
 list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=)))

close(connection)

 Input Data 

TDSaved - TestDates

 List of Date Variables 

DateNames - c(birthDT, diagnosisDT, metastaticDT)

 Date Function 

fun - function(Dat){
f - function(jj, DF){
x - as.character(DF[, jj])
x - unlist(strsplit(x, /))
n - length(x)
M - x[seq(1, n, 3)]
D - x[seq(2, n, 3)]
Y - x[seq(3, n, 3)]
D[D == un] - 15
Y - ifelse(nchar(Y)  4 | Y  2012 | Y  1900, NA, Y)
x - as.Date(paste(Y, M, D, sep=-), format=%Y-%m-%d)
if(any(is.na(x)  M != un  Y != un))
cat(Warning: Invalid date values in, jj, \n,
as.character(DF[is.na(x), jj]), \n)
x
}
Dat - data.frame(sapply(names(Dat), function(j) f(j, Dat)))
for(i in names(Dat)) class(Dat[[i]]) - Date
Dat
}

 Output Data 

TD - TDSaved

 Read Dates 

TD[, DateNames] - fun(TD[, DateNames])
TD

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-30 Thread David Winsemius


On Jan 30, 2012, at 8:44 AM, Paul Miller wrote:


Hi Rui, Marc, and Gabor,

Thanks for your replies to my question. All were helpful and it was  
interesting to see how different people approach various aspects of  
the same problem.


Spent some time this weekend looking at Rui's solution, which is  
certainly much clearer than my own. Managed to figure out pretty  
much all the details of how it works. Also managed to tweak it  
slightly in order to make it do exactly what I wanted. (See revised  
code below.)


Still have a couple of questions though. The first concerns the  
insertion of the code Y  2012 to set year values beyond 2012 to  
NA (on line 10 of the function below).  When I add this (or use it  
in place of nchar(Y)  4), the code succesfully finds the problem  
date 05/16/2015. After that though, it produces the following  
error message:


Error in if (any(is.na(x)  M != un  Y != un)) cat(Warning:  
Invalid date values in,  :  missing value where TRUE/FALSE needed


It's a bit dangerous to use comparison operators on mixed data types.  
In your case you are comparing a character value to a numeric value  
and may not realize that 2015 is not the same as 2015. Try 123   
1000 if you want a quick counter-example. You may want to coerce the Y  
value to numeric mode to be safe.


Also 'any' does not expect the logical connectives. You probably want:

any(is.na(x) , M != un , Y != un)



Why is this happening? If the code correctly correctly handles the  
date 06/20/1840 without producing an error, why can't it do  
likelwise with 05/16/2015?


The second question is why it's necessary to put x on line 15  
following cat(Warning ...). I know that I don't get any date  
columns if I don't include this but am not sure why.


The third question is whether it's possible to change the class of  
the date variables without using a for loop. I played around with  
this a little but didn't find a vectorized alternative. It may be  
that this is not really important. It's just that I've read in  
several places that for loops should be avoided wherever possible.


Thanks,

Paul


##
 Code for detecting invalid dates 
##

 Test Data 

connection - textConnection(
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1840  02/30/2010 03/17/2011
3 06/17/1935  12/20/2008 07/un/2011
4 05/31/1937  01/18/2007 04/30/2011
5 06/31/1933  05/16/2015 11/20/un
)

TestDates - data.frame(scan(connection,
 list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=)))

close(connection)

 Input Data 

TDSaved - TestDates

 List of Date Variables 

DateNames - c(birthDT, diagnosisDT, metastaticDT)

 Date Function 

fun - function(Dat){
   f - function(jj, DF){
   x - as.character(DF[, jj])
   x - unlist(strsplit(x, /))
   n - length(x)
   M - x[seq(1, n, 3)]
   D - x[seq(2, n, 3)]
   Y - x[seq(3, n, 3)]
   D[D == un] - 15
   Y - ifelse(nchar(Y)  4 | Y  2012 | Y  1900, NA, Y)
   x - as.Date(paste(Y, M, D, sep=-), format=%Y-%m-%d)
   if(any(is.na(x)  M != un  Y != un))
   cat(Warning: Invalid date values in, jj, \n,
   as.character(DF[is.na(x), jj]), \n)
   x
   }
   Dat - data.frame(sapply(names(Dat), function(j) f(j, Dat)))
   for(i in names(Dat)) class(Dat[[i]]) - Date
   Dat
}

 Output Data 

TD - TDSaved

 Read Dates 

TD[, DateNames] - fun(TD[, DateNames])
TD

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-30 Thread Marc Schwartz

On Jan 30, 2012, at 12:15 PM, David Winsemius wrote:

 
 On Jan 30, 2012, at 8:44 AM, Paul Miller wrote:
 
 Hi Rui, Marc, and Gabor,
 
 Thanks for your replies to my question. All were helpful and it was 
 interesting to see how different people approach various aspects of the same 
 problem.
 
 Spent some time this weekend looking at Rui's solution, which is certainly 
 much clearer than my own. Managed to figure out pretty much all the details 
 of how it works. Also managed to tweak it slightly in order to make it do 
 exactly what I wanted. (See revised code below.)
 
 Still have a couple of questions though. The first concerns the insertion of 
 the code Y  2012 to set year values beyond 2012 to NA (on line 10 of the 
 function below).  When I add this (or use it in place of nchar(Y)  4), 
 the code succesfully finds the problem date 05/16/2015. After that though, 
 it produces the following error message:
 
 Error in if (any(is.na(x)  M != un  Y != un)) cat(Warning: Invalid 
 date values in,  :  missing value where TRUE/FALSE needed
 
 It's a bit dangerous to use comparison operators on mixed data types. In your 
 case you are comparing a character value to a numeric value and may not 
 realize that 2015 is not the same as 2015. Try 123  1000 if you want a 
 quick counter-example. You may want to coerce the Y value to numeric mode 
 to be safe.
 
 Also 'any' does not expect the logical connectives. You probably want:
 
 any(is.na(x) , M != un , Y != un)


Perhaps I am missing something relevant here, but I am still confused by what I 
see as an over engineering of the code being implemented. If the primary 
requirements are:

1. Impute the 15th of month if it is 'un'
2. Reject dates prior to 1900 or after 2011
3. Reject dates with an unknown ('un') month or year
4. Reject years with 4 digits, also presuming that the value passed should 
always be 10 characters in length

If that is the basic functionality required, then a modest modification of my 
prior code should work:

checkDate - function(x) {

  # Replace unknown day with 15
  tmp - gsub(/un/, /15/, x)

  tmp2 - as.Date(tmp, format = %m/%d/%Y)

  as.character(x[is.na(tmp2) | 
 tmp2  as.Date(1900/01/01) |
 tmp2  as.Date(2012/01/01) |
 nchar(as.character(x))  10])
}


 TestDates
  Patient birthDT diagnosisDT metastaticDT
1   1 11/23/21931  05/23/2009   un/17/2011
2   2  06/20/1840  02/30/2010   03/17/2011
3   3  06/17/1935  12/20/2008   07/un/2011
4   4  05/31/1937  01/18/2007   04/30/2011
5   5  06/31/1933  05/16/2015 11/20/un


 lapply(TestDates[, -1], checkDate)
$birthDT
[1] 11/23/21931 06/20/1840  06/31/1933 

$diagnosisDT
[1] 02/30/2010 05/16/2015

$metastaticDT
[1] un/17/2011 11/20/un  


Does that not do what you require Paul?

Marc

 
 
 Why is this happening? If the code correctly correctly handles the date 
 06/20/1840 without producing an error, why can't it do likelwise with 
 05/16/2015?
 
 The second question is why it's necessary to put x on line 15 following 
 cat(Warning ...). I know that I don't get any date columns if I don't 
 include this but am not sure why.
 
 The third question is whether it's possible to change the class of the date 
 variables without using a for loop. I played around with this a little but 
 didn't find a vectorized alternative. It may be that this is not really 
 important. It's just that I've read in several places that for loops should 
 be avoided wherever possible.
 
 Thanks,
 
 Paul

snip prior content

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-30 Thread Marc Schwartz

On Jan 30, 2012, at 1:30 PM, Marc Schwartz wrote:

 
 On Jan 30, 2012, at 12:15 PM, David Winsemius wrote:
 
 
 On Jan 30, 2012, at 8:44 AM, Paul Miller wrote:
 
 Hi Rui, Marc, and Gabor,
 
 Thanks for your replies to my question. All were helpful and it was 
 interesting to see how different people approach various aspects of the 
 same problem.
 
 Spent some time this weekend looking at Rui's solution, which is certainly 
 much clearer than my own. Managed to figure out pretty much all the details 
 of how it works. Also managed to tweak it slightly in order to make it do 
 exactly what I wanted. (See revised code below.)
 
 Still have a couple of questions though. The first concerns the insertion 
 of the code Y  2012 to set year values beyond 2012 to NA (on line 10 of 
 the function below).  When I add this (or use it in place of nchar(Y)  
 4), the code succesfully finds the problem date 05/16/2015. After that 
 though, it produces the following error message:
 
 Error in if (any(is.na(x)  M != un  Y != un)) cat(Warning: Invalid 
 date values in,  :  missing value where TRUE/FALSE needed
 
 It's a bit dangerous to use comparison operators on mixed data types. In 
 your case you are comparing a character value to a numeric value and may not 
 realize that 2015 is not the same as 2015. Try 123  1000 if you want a 
 quick counter-example. You may want to coerce the Y value to numeric mode 
 to be safe.
 
 Also 'any' does not expect the logical connectives. You probably want:
 
 any(is.na(x) , M != un , Y != un)
 
 
 Perhaps I am missing something relevant here, but I am still confused by what 
 I see as an over engineering of the code being implemented. If the primary 
 requirements are:
 
 1. Impute the 15th of month if it is 'un'
 2. Reject dates prior to 1900 or after 2011
 3. Reject dates with an unknown ('un') month or year
 4. Reject years with 4 digits, also presuming that the value passed should 
 always be 10 characters in length
 
 If that is the basic functionality required, then a modest modification of my 
 prior code should work:


Ack...typo in my code for the upper end of the date range. Should be:

checkDate - function(x) {

 # Replace unknown day with 15
 tmp - gsub(/un/, /15/, x)

 tmp2 - as.Date(tmp, format = %m/%d/%Y)

 as.character(x[is.na(tmp2) | 
  tmp2  as.Date(1900/01/01) |
  tmp2  as.Date(2011/12/31) |
  nchar(as.character(x))  10])
}


Marc

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-30 Thread Rui Barradas
Hello,
I'm glad it helped.


 
 Error in if (any(is.na(x)  M != un  Y != un)) cat(Warning: Invalid
 date values in,  :
 missing value where TRUE/FALSE needed
 
 Why is this happening? If the code correctly correctly handles the date
 06/20/1840 without producing an error,
 why can't it do likelwise with 05/16/2015?
 

Because un is greater than 2012. The problem starts with the 'ifelse',
we are changing the Y values to NA's
if the condition is met.
Correction: instead of ifelse use

Known - M != un  Y != un
Y[nchar(Y)  4 | Y  2012 | Y  1900] - NA

Now you need to place the conjuntion of Known and is.na(x) in both the 'if'
and 'cat' statements.
(Only dates with known year and month but wrongly keyed in will be printed.)


if(any(Known  is.na(x)))
cat(Warning: Invalid date values in, jj, \n,
as.character(DF[Known  is.na(x), jj]), \n) 

Like David said, 'any' expects comma separated logical vectors but in this
case the conjunction  agrees more with what is wanted.
It's an adversative conjunction, that reads Known BUT is.na(x).
The point is that according to your code the unknown shouldn't be printed.
They are just NA,  not invalid.

Does this also answer to the second question?

As for the third question, there are so few columns that I don't believe the
for loop can hurt.
There could be a problem in changing the class to Date using *apply because
R passes arguments to functions by value and only the copy inside the
function would be changed.
Considering the number of iterations in the loop, it's maybe simpler like
this.

Rui Barradas


--
View this message in context: 
http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4341018p4342250.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Checking for invalid dates: Code works but needs improvement

2012-01-26 Thread Paul Miller
Sorry, sent this earlier but forgot to add an informative subject line. Am 
resending, in the hopes of getting further replies. My apologies. Hope this is 
OK.

Paul


Hi Rui,

Thanks for your reply to my post. My code still has various shortcomings but at 
least now it is fully functional.

It may be that, as I transition to using R, I'll have to live with some less 
than ideal code, at least at the outset. I'll just have to write and re-write 
my code as I improve.

Appreciate your help.

Paul


Message: 66
Date: Tue, 24 Jan 2012 09:54:57 -0800 (PST)
From: Rui Barradas ruipbarra...@sapo.pt
To: r-help@r-project.org
Subject: Re: [R] Checking for invalid dates: Code works but needs
improvement
Message-ID: 1327427697928-4324533.p...@n4.nabble.com
Content-Type: text/plain; charset=us-ascii

Hello,

Point 3 is very simple, instead of 'print' use 'cat'.
Unlike 'print' it allows for several arguments and (very) simple formating.

  { cat(Error: Invalid date values in, DateNames[[i]], \n,
   TestDates[DateNames][[i]][TestDates$Invalid==1], \n) }

Rui Barradas

Message: 53
Date: Tue, 24 Jan 2012 08:54:49 -0800 (PST)
From: Paul Miller pjmiller...@yahoo.com
To: r-help@r-project.org
Subject: [R] Checking for invalid dates: Code works but needs
improvement
Message-ID:
1327424089.1149.yahoomailclas...@web161604.mail.bf1.yahoo.com
Content-Type: text/plain; charset=us-ascii

Hello Everyone,

Still new to R. Wrote some code that finds and prints invalid dates (see 
below). This code works but I suspect it's not very good. If someone could show 
me a better way, I'd greatly appreciate it.

Here is some information about what I'm trying to accomplish. My sense is that 
the R date functions are best at identifying invalid dates when fed character 
data in their default format. So my code converts the input dates to character, 
breaks them apart using strsplit, and then reformats them. It then identifies 
which dates are missing in the sense that the month or year are unknown and 
prints out any remaining invalid date values. 

As I see it, the code has at least 4 shortcomings.

1. It's too long. My understanding is that skilled programmers can usually or 
often complete tasks like this in a few lines.

2. It's not vectorized. I started out trying to do something that was 
vectorized but ran into problems with the strsplit function. I looked at the 
help file and it appears this function will only accept a single character 
vector.

3. It prints out the incorrect dates but doesn't indicate which date variable 
they belong to. I tried various things with paste but never came up with 
anything that worked. Ideally, I'd like to get something that looks roughly 
like:

Error: Invalid date values in birthDT

21931-11-23 
1933-06-31

Error: Invalid date values in diagnosisDT

2010-02-30

4. There's no way to specify names for input and output data. I imagine this 
would be fairly easy to specify this in the arguments to a function but am not 
sure how to incorporate it into a for loop.

Thanks,

Paul  

##
 Code for detecting invalid dates 
##

 Test Data 

connection - textConnection(
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1940  02/30/2010 03/17/2011
3 06/17/1935  12/20/2008 07/un/2011
4 05/31/1937  01/18/2007 04/30/2011
5 06/31/1933  05/16/2009 11/20/un
)

TestDates - data.frame(scan(connection, 
 list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=)))

close(connection)

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

 List of Date Variables 

DateNames - c(birthDT, diagnosisDT, metastaticDT)

 Read Dates 

for (i in seq(TestDates[DateNames])){
TestDates[DateNames][[i]] - as.character(TestDates[DateNames][[i]])
TestDates$ParsedDT - strsplit(TestDates[DateNames][[i]],/)
TestDates$Month - sapply(TestDates$ParsedDT,function(x)x[1])
TestDates$Day - sapply(TestDates$ParsedDT,function(x)x[2])
TestDates$Year - sapply(TestDates$ParsedDT,function(x)x[3])
TestDates$Day[TestDates$Day==un] - 15
TestDates[DateNames][[i]] - with(TestDates, paste(Year, Month, Day, sep = -))
is.na( TestDates[DateNames][[i]] [TestDates$Month==un] ) - T
is.na( TestDates[DateNames][[i]] [TestDates$Year==un] ) - T
TestDates$Date - as.Date(TestDates[DateNames][[i]], format=%Y-%m-%d)
TestDates$Invalid - ifelse(is.na(TestDates$Date)  
!is.na(TestDates[DateNames][[i]]), 1, 0)
if( sum(TestDates$Invalid)==0 ) 
{ TestDates[DateNames][[i]] - TestDates$Date } else
{ print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
TestDates - subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, 
Invalid))
}

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http

Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-26 Thread Marc Schwartz
Paul,

I have a partial solution for you. It is partial in that I have not quite 
figured out the correct incantation to convert a 5 digit year (eg. 11/23/21931) 
properly using the R date functions. According to various sources (eg. man 
strptime and man strftime) as well as the R help for both functions, there are 
extended formats available, but I am having a bout of cerebral flatulence in 
getting them to work correctly and a search has not been fruitful. Perhaps 
someone else can offer some insights.

That being said, with the exception of correctly handling that one situation, 
which arguably IS a valid date a long time in the future and which would 
otherwise result in a truncated year (first four digits only)

 as.Date(11/23/21931, format = %m/%d/%Y)
[1] 2193-11-23

Here is one approach:

# Check the date. If as.Date() fails or the input is  10 characters return it
checkDate - function(x) as.character(x[is.na(as.Date(x, format = %m/%d/%Y)) 
| 
nchar(as.character(x))  10])

 lapply(TestDates[, -1], checkDate)
$birthDT
[1] 11/23/21931 06/31/1933 

$diagnosisDT
[1] 02/30/2010

$metastaticDT
[1] un/17/2011 07/un/2011 11/20/un  


You could fine tune the checkDate() function to handle other formats, etc.

HTH,

Marc Schwartz


On Jan 26, 2012, at 9:54 AM, Paul Miller wrote:

 Sorry, sent this earlier but forgot to add an informative subject line. Am 
 resending, in the hopes of getting further replies. My apologies. Hope this 
 is OK.
 
 Paul
 
 
 Hi Rui,
 
 Thanks for your reply to my post. My code still has various shortcomings but 
 at least now it is fully functional.
 
 It may be that, as I transition to using R, I'll have to live with some less 
 than ideal code, at least at the outset. I'll just have to write and re-write 
 my code as I improve.
 
 Appreciate your help.
 
 Paul
 
 
 Message: 66
 Date: Tue, 24 Jan 2012 09:54:57 -0800 (PST)
 From: Rui Barradas ruipbarra...@sapo.pt
 To: r-help@r-project.org
 Subject: Re: [R] Checking for invalid dates: Code works but needs
improvement
 Message-ID: 1327427697928-4324533.p...@n4.nabble.com
 Content-Type: text/plain; charset=us-ascii
 
 Hello,
 
 Point 3 is very simple, instead of 'print' use 'cat'.
 Unlike 'print' it allows for several arguments and (very) simple formating.
 
  { cat(Error: Invalid date values in, DateNames[[i]], \n,
   TestDates[DateNames][[i]][TestDates$Invalid==1], \n) }
 
 Rui Barradas
 
 Message: 53
 Date: Tue, 24 Jan 2012 08:54:49 -0800 (PST)
 From: Paul Miller pjmiller...@yahoo.com
 To: r-help@r-project.org
 Subject: [R] Checking for invalid dates: Code works but needs
improvement
 Message-ID:
1327424089.1149.yahoomailclas...@web161604.mail.bf1.yahoo.com
 Content-Type: text/plain; charset=us-ascii
 
 Hello Everyone,
 
 Still new to R. Wrote some code that finds and prints invalid dates (see 
 below). This code works but I suspect it's not very good. If someone could 
 show me a better way, I'd greatly appreciate it.
 
 Here is some information about what I'm trying to accomplish. My sense is 
 that the R date functions are best at identifying invalid dates when fed 
 character data in their default format. So my code converts the input dates 
 to character, breaks them apart using strsplit, and then reformats them. It 
 then identifies which dates are missing in the sense that the month or year 
 are unknown and prints out any remaining invalid date values. 
 
 As I see it, the code has at least 4 shortcomings.
 
 1. It's too long. My understanding is that skilled programmers can usually or 
 often complete tasks like this in a few lines.
 
 2. It's not vectorized. I started out trying to do something that was 
 vectorized but ran into problems with the strsplit function. I looked at the 
 help file and it appears this function will only accept a single character 
 vector.
 
 3. It prints out the incorrect dates but doesn't indicate which date variable 
 they belong to. I tried various things with paste but never came up with 
 anything that worked. Ideally, I'd like to get something that looks roughly 
 like:
 
 Error: Invalid date values in birthDT
 
 21931-11-23 
 1933-06-31
 
 Error: Invalid date values in diagnosisDT
 
 2010-02-30
 
 4. There's no way to specify names for input and output data. I imagine this 
 would be fairly easy to specify this in the arguments to a function but am 
 not sure how to incorporate it into a for loop.
 
 Thanks,
 
 Paul  
 
 ##
  Code for detecting invalid dates 
 ##
 
  Test Data 
 
 connection - textConnection(
 1 11/23/21931 05/23/2009 un/17/2011
 2 06/20/1940  02/30/2010 03/17/2011
 3 06/17/1935  12/20/2008 07/un/2011
 4 05/31/1937  01/18/2007 04/30/2011
 5 06/31/1933  05/16/2009 11/20/un
 )
 
 TestDates - data.frame(scan(connection, 
 list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=)))
 
 close(connection

Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-26 Thread Gabor Grothendieck
On Tue, Jan 24, 2012 at 11:54 AM, Paul Miller pjmiller...@yahoo.com wrote:
 Hello Everyone,

 Still new to R. Wrote some code that finds and prints invalid dates (see 
 below). This code works but I suspect it's not very good. If someone could 
 show me a better way, I'd greatly appreciate it.

 Here is some information about what I'm trying to accomplish. My sense is 
 that the R date functions are best at identifying invalid dates when fed 
 character data in their default format. So my code converts the input dates 
 to character, breaks them apart using strsplit, and then reformats them. It 
 then identifies which dates are missing in the sense that the month or year 
 are unknown and prints out any remaining invalid date values.

 As I see it, the code has at least 4 shortcomings.

 1. It's too long. My understanding is that skilled programmers can usually or 
 often complete tasks like this in a few lines.

 2. It's not vectorized. I started out trying to do something that was 
 vectorized but ran into problems with the strsplit function. I looked at the 
 help file and it appears this function will only accept a single character 
 vector.

 3. It prints out the incorrect dates but doesn't indicate which date variable 
 they belong to. I tried various things with paste but never came up with 
 anything that worked. Ideally, I'd like to get something that looks roughly 
 like:

 Error: Invalid date values in birthDT

 21931-11-23
 1933-06-31

 Error: Invalid date values in diagnosisDT

 2010-02-30

 4. There's no way to specify names for input and output data. I imagine this 
 would be fairly easy to specify this in the arguments to a function but am 
 not sure how to incorporate it into a for loop.

 Thanks,

 Paul

 ##
  Code for detecting invalid dates 
 ##

  Test Data 

 connection - textConnection(
 1 11/23/21931 05/23/2009 un/17/2011
 2 06/20/1940  02/30/2010 03/17/2011
 3 06/17/1935  12/20/2008 07/un/2011
 4 05/31/1937  01/18/2007 04/30/2011
 5 06/31/1933  05/16/2009 11/20/un
 )

 TestDates - data.frame(scan(connection,
                 list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=)))

 close(connection)

 TestDates

 class(TestDates$birthDT)
 class(TestDates$diagnosisDT)
 class(TestDates$metastaticDT)

  List of Date Variables 

 DateNames - c(birthDT, diagnosisDT, metastaticDT)

  Read Dates 

 for (i in seq(TestDates[DateNames])){
 TestDates[DateNames][[i]] - as.character(TestDates[DateNames][[i]])
 TestDates$ParsedDT - strsplit(TestDates[DateNames][[i]],/)
 TestDates$Month - sapply(TestDates$ParsedDT,function(x)x[1])
 TestDates$Day - sapply(TestDates$ParsedDT,function(x)x[2])
 TestDates$Year - sapply(TestDates$ParsedDT,function(x)x[3])
 TestDates$Day[TestDates$Day==un] - 15
 TestDates[DateNames][[i]] - with(TestDates, paste(Year, Month, Day, sep = 
 -))
 is.na( TestDates[DateNames][[i]] [TestDates$Month==un] ) - T
 is.na( TestDates[DateNames][[i]] [TestDates$Year==un] ) - T
 TestDates$Date - as.Date(TestDates[DateNames][[i]], format=%Y-%m-%d)
 TestDates$Invalid - ifelse(is.na(TestDates$Date)  
 !is.na(TestDates[DateNames][[i]]), 1, 0)
 if( sum(TestDates$Invalid)==0 )
        { TestDates[DateNames][[i]] - TestDates$Date } else
        { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
 TestDates - subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, 
 Invalid))
 }

 TestDates

 class(TestDates$birthDT)
 class(TestDates$diagnosisDT)
 class(TestDates$metastaticDT)

If s is a vector of character strings representing dates then bad is a
logical vector which is TRUE for the bad ones and FALSE for the good
ones (adjust as needed if a different date range is valid) so s[bad]
is the bad inputs and the output d is a Date vector with NAs for the
bad ones:

x - gsub(un, 15, s)
d - as.Date(x, %m/%d/%Y)
bad - is.na(d) | d  as.Date(1900-01-01) | d  Sys.Date()
d[bad] - NA

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-26 Thread Rui Barradas
Hello, again.

I now have a more complete answer to your points.

 1. It's too long. My understanding is that skilled programmers can usually
 or often complete tasks like this in a few lines.

It's not very shorter but it's more readable. (The programmer is always
suspect)

 2. It's not vectorized. I started out trying to do something that was
 vectorized
 but ran into problems with the strsplit function. I looked at the help
 file and
 it appears this function will only accept a single character vector. 

All but one instructions are vectorized. And the one that is not only loops
for
a few column names.
Use 'unlist' on the 'strsplit' function's output to give a vector.

 4. There's no way to specify names for input and output data. I imagine
 this would
 be fairly easy to specify this in the arguments to a function but am not
 sure how to
 incorporate it into a for loop.

You can now specify any matrix or data.frame, but it will only process the
columns with
dates. (This is not true, it will process anything with a '/' on it. Pay
attention.)

Near the beginning of your code include the following:


 TestDates - data.frame(scan(connection,
 list(Patient=0, birthDT=, diagnosisDT=,
 metastaticDT=)))

 close(connection)

TDSaved - TestDates# to avoid reopenning the connection

And then, after all of it,

fun - function(Dat){
f - function(jj, DF){
x - as.character(DF[, jj])
x - unlist(strsplit(x, /))
n - length(x)
M - x[seq(1, n, 3)]
D - x[seq(2, n, 3)]
Y - x[seq(3, n, 3)]
D[D == un] - 15
Y - ifelse(nchar(Y)  4 | Y  1900, NA, Y)
x - as.Date(paste(Y, M, D, sep=-), format=%Y-%m-%d)
if(any(is.na(x)))
cat(Warning: Invalid date values in, jj, \n,
as.character(DF[is.na(x), jj]), \n)
x
}
colinx - colnames(as.data.frame(Dat))
Dat - data.frame(sapply(colinx, function(j) f(j, Dat)))
for(i in colinx) class(Dat[[i]]) - Date
Dat
}

TD - TDSaved

TD[, DateNames] - fun(TD[, DateNames])

TD

Had fun in writing it.
Good luck.

Rui Barradas



--
View this message in context: 
http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4324356p4332529.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Checking for invalid dates: Code works but needs improvement

2012-01-24 Thread Paul Miller
Hello Everyone,

Still new to R. Wrote some code that finds and prints invalid dates (see 
below). This code works but I suspect it's not very good. If someone could show 
me a better way, I'd greatly appreciate it.

Here is some information about what I'm trying to accomplish. My sense is that 
the R date functions are best at identifying invalid dates when fed character 
data in their default format. So my code converts the input dates to character, 
breaks them apart using strsplit, and then reformats them. It then identifies 
which dates are missing in the sense that the month or year are unknown and 
prints out any remaining invalid date values. 

As I see it, the code has at least 4 shortcomings.

1. It's too long. My understanding is that skilled programmers can usually or 
often complete tasks like this in a few lines.

2. It's not vectorized. I started out trying to do something that was 
vectorized but ran into problems with the strsplit function. I looked at the 
help file and it appears this function will only accept a single character 
vector.

3. It prints out the incorrect dates but doesn't indicate which date variable 
they belong to. I tried various things with paste but never came up with 
anything that worked. Ideally, I'd like to get something that looks roughly 
like:

Error: Invalid date values in birthDT

21931-11-23 
1933-06-31

Error: Invalid date values in diagnosisDT

2010-02-30

4. There's no way to specify names for input and output data. I imagine this 
would be fairly easy to specify this in the arguments to a function but am not 
sure how to incorporate it into a for loop.

Thanks,

Paul  

##
 Code for detecting invalid dates 
##

 Test Data 

connection - textConnection(
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1940  02/30/2010 03/17/2011
3 06/17/1935  12/20/2008 07/un/2011
4 05/31/1937  01/18/2007 04/30/2011
5 06/31/1933  05/16/2009 11/20/un
)

TestDates - data.frame(scan(connection, 
 list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=)))

close(connection)

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

 List of Date Variables 

DateNames - c(birthDT, diagnosisDT, metastaticDT)

 Read Dates 

for (i in seq(TestDates[DateNames])){
TestDates[DateNames][[i]] - as.character(TestDates[DateNames][[i]])
TestDates$ParsedDT - strsplit(TestDates[DateNames][[i]],/)
TestDates$Month - sapply(TestDates$ParsedDT,function(x)x[1])
TestDates$Day - sapply(TestDates$ParsedDT,function(x)x[2])
TestDates$Year - sapply(TestDates$ParsedDT,function(x)x[3])
TestDates$Day[TestDates$Day==un] - 15
TestDates[DateNames][[i]] - with(TestDates, paste(Year, Month, Day, sep = -))
is.na( TestDates[DateNames][[i]] [TestDates$Month==un] ) - T
is.na( TestDates[DateNames][[i]] [TestDates$Year==un] ) - T
TestDates$Date - as.Date(TestDates[DateNames][[i]], format=%Y-%m-%d)
TestDates$Invalid - ifelse(is.na(TestDates$Date)  
!is.na(TestDates[DateNames][[i]]), 1, 0)
if( sum(TestDates$Invalid)==0 ) 
{ TestDates[DateNames][[i]] - TestDates$Date } else
{ print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
TestDates - subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, 
Invalid))
}

TestDates

class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Checking for invalid dates: Code works but needs improvement

2012-01-24 Thread Rui Barradas
Hello,

Point 3 is very simple, instead of 'print' use 'cat'.
Unlike 'print' it allows for several arguments and (very) simple formating.

  { cat(Error: Invalid date values in, DateNames[[i]], \n,
   TestDates[DateNames][[i]][TestDates$Invalid==1], \n) }

Rui Barradas



--
View this message in context: 
http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4324356p4324533.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.