Re: [R] Checking for invalid dates: Code works but needs improvement
Hi David and Rui, Sorry to be so slow in replying. Thank you both for pointing out that the problem with my code was that I was using comparison operators on mixed data types. This is something I'll have to be more careful about in the future. In an earlier email, David talked about how R can seem uncooperative or even unfair when you're just starting out. I too have had this experience, but it seems less unfair each time I use it. This time, I was able to write inelegant but functional code to solve my problem. Last time, I wasn't able to solve a much simpler problem at all. So I guess that's a kind of progress. At this point, I have serviceable code for checking my dates. I can improve this when I begin to develop some real skill as an R programmer, but it will do nicely for now. Thanks everyone for your help with this. Paul __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Checking for invalid dates: Code works but needs improvement
Hi Marc, That the code I wrote initially is over engineered is certainly possible. Of course, Rui's solution is a reworking of that code. If starting from scratch, Rui likely would have done something quite different. I focused on Rui's code because it was complete and was a clear improvement over what I had intially. Your code is great but doesn't do everything I think I need. This leads to the question of exactly what I do need. I thought I had done well to create sample data and working (albeit inelegant) code. I realize now though that this really wasn't sufficient. It would have been better to supply a list of things I would like the code to do. So let me try again. Ideally I'd like to have a function called something like readDates, a call to which might look like: readDates(indata=TestDates, outdata=TestDates2, dates=c(birthDT, diagnosisDT, metastaticDT), datefmt=%m/%d/%Y, monthimp=15, mindt=1900-01-01, maxdt=Sys.Date()) Besides allowing the user to specify an input data name, an output data name, the dates to be read, and an incoming date format, the readDates function would: 1. Impute by default the 15th of month if it is 'un', 'unk', 'Un', 'Unk', 'UN', etc, but allow the user to select another value such as the 1st. 2. Reject by default dates before 1900-01-01 or after the current date, but allow the user to specify other values. 3. Ignore dates with month or year values of 'un', 'unk', 'Un', 'Unk', 'UN', etc. That is, set them to missing but not report them as part of a warning message. 4. Reject dates with components (month, day, or year) that are not of the correct length. In most cases, I think this would involve lengths of 2,2, and 4. For some date formats though (e.g., 05Jan2012), this might not be the case. 5. Print warning messages for invalid dates something like: Warning: Invalid date values in birthDT 11/23/21931 06/20/1840 06/31/1933 Warning: Invalid date values in diagnosisDT 02/30/2010 05/16/2015 6. Convert to a date any input columns that do not have invalid dates. This would include columns with unknown month and year values, like my metastaticDT. 7. Allow things like the date format and minimum and maximum date values to vary by input column. Admittedly, this is a lot. And I wouldn't blame you if you didn't want to touch it with a ten-foot pole. It's what's on my wish list though. Thanks, Paul __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Checking for invalid dates: Code works but needs improvement
Hi Rui, Marc, and Gabor, Thanks for your replies to my question. All were helpful and it was interesting to see how different people approach various aspects of the same problem. Spent some time this weekend looking at Rui's solution, which is certainly much clearer than my own. Managed to figure out pretty much all the details of how it works. Also managed to tweak it slightly in order to make it do exactly what I wanted. (See revised code below.) Still have a couple of questions though. The first concerns the insertion of the code Y 2012 to set year values beyond 2012 to NA (on line 10 of the function below). When I add this (or use it in place of nchar(Y) 4), the code succesfully finds the problem date 05/16/2015. After that though, it produces the following error message: Error in if (any(is.na(x) M != un Y != un)) cat(Warning: Invalid date values in, : missing value where TRUE/FALSE needed Why is this happening? If the code correctly correctly handles the date 06/20/1840 without producing an error, why can't it do likelwise with 05/16/2015? The second question is why it's necessary to put x on line 15 following cat(Warning ...). I know that I don't get any date columns if I don't include this but am not sure why. The third question is whether it's possible to change the class of the date variables without using a for loop. I played around with this a little but didn't find a vectorized alternative. It may be that this is not really important. It's just that I've read in several places that for loops should be avoided wherever possible. Thanks, Paul ## Code for detecting invalid dates ## Test Data connection - textConnection( 1 11/23/21931 05/23/2009 un/17/2011 2 06/20/1840 02/30/2010 03/17/2011 3 06/17/1935 12/20/2008 07/un/2011 4 05/31/1937 01/18/2007 04/30/2011 5 06/31/1933 05/16/2015 11/20/un ) TestDates - data.frame(scan(connection, list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=))) close(connection) Input Data TDSaved - TestDates List of Date Variables DateNames - c(birthDT, diagnosisDT, metastaticDT) Date Function fun - function(Dat){ f - function(jj, DF){ x - as.character(DF[, jj]) x - unlist(strsplit(x, /)) n - length(x) M - x[seq(1, n, 3)] D - x[seq(2, n, 3)] Y - x[seq(3, n, 3)] D[D == un] - 15 Y - ifelse(nchar(Y) 4 | Y 2012 | Y 1900, NA, Y) x - as.Date(paste(Y, M, D, sep=-), format=%Y-%m-%d) if(any(is.na(x) M != un Y != un)) cat(Warning: Invalid date values in, jj, \n, as.character(DF[is.na(x), jj]), \n) x } Dat - data.frame(sapply(names(Dat), function(j) f(j, Dat))) for(i in names(Dat)) class(Dat[[i]]) - Date Dat } Output Data TD - TDSaved Read Dates TD[, DateNames] - fun(TD[, DateNames]) TD __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Checking for invalid dates: Code works but needs improvement
On Jan 30, 2012, at 8:44 AM, Paul Miller wrote: Hi Rui, Marc, and Gabor, Thanks for your replies to my question. All were helpful and it was interesting to see how different people approach various aspects of the same problem. Spent some time this weekend looking at Rui's solution, which is certainly much clearer than my own. Managed to figure out pretty much all the details of how it works. Also managed to tweak it slightly in order to make it do exactly what I wanted. (See revised code below.) Still have a couple of questions though. The first concerns the insertion of the code Y 2012 to set year values beyond 2012 to NA (on line 10 of the function below). When I add this (or use it in place of nchar(Y) 4), the code succesfully finds the problem date 05/16/2015. After that though, it produces the following error message: Error in if (any(is.na(x) M != un Y != un)) cat(Warning: Invalid date values in, : missing value where TRUE/FALSE needed It's a bit dangerous to use comparison operators on mixed data types. In your case you are comparing a character value to a numeric value and may not realize that 2015 is not the same as 2015. Try 123 1000 if you want a quick counter-example. You may want to coerce the Y value to numeric mode to be safe. Also 'any' does not expect the logical connectives. You probably want: any(is.na(x) , M != un , Y != un) Why is this happening? If the code correctly correctly handles the date 06/20/1840 without producing an error, why can't it do likelwise with 05/16/2015? The second question is why it's necessary to put x on line 15 following cat(Warning ...). I know that I don't get any date columns if I don't include this but am not sure why. The third question is whether it's possible to change the class of the date variables without using a for loop. I played around with this a little but didn't find a vectorized alternative. It may be that this is not really important. It's just that I've read in several places that for loops should be avoided wherever possible. Thanks, Paul ## Code for detecting invalid dates ## Test Data connection - textConnection( 1 11/23/21931 05/23/2009 un/17/2011 2 06/20/1840 02/30/2010 03/17/2011 3 06/17/1935 12/20/2008 07/un/2011 4 05/31/1937 01/18/2007 04/30/2011 5 06/31/1933 05/16/2015 11/20/un ) TestDates - data.frame(scan(connection, list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=))) close(connection) Input Data TDSaved - TestDates List of Date Variables DateNames - c(birthDT, diagnosisDT, metastaticDT) Date Function fun - function(Dat){ f - function(jj, DF){ x - as.character(DF[, jj]) x - unlist(strsplit(x, /)) n - length(x) M - x[seq(1, n, 3)] D - x[seq(2, n, 3)] Y - x[seq(3, n, 3)] D[D == un] - 15 Y - ifelse(nchar(Y) 4 | Y 2012 | Y 1900, NA, Y) x - as.Date(paste(Y, M, D, sep=-), format=%Y-%m-%d) if(any(is.na(x) M != un Y != un)) cat(Warning: Invalid date values in, jj, \n, as.character(DF[is.na(x), jj]), \n) x } Dat - data.frame(sapply(names(Dat), function(j) f(j, Dat))) for(i in names(Dat)) class(Dat[[i]]) - Date Dat } Output Data TD - TDSaved Read Dates TD[, DateNames] - fun(TD[, DateNames]) TD __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Checking for invalid dates: Code works but needs improvement
On Jan 30, 2012, at 12:15 PM, David Winsemius wrote: On Jan 30, 2012, at 8:44 AM, Paul Miller wrote: Hi Rui, Marc, and Gabor, Thanks for your replies to my question. All were helpful and it was interesting to see how different people approach various aspects of the same problem. Spent some time this weekend looking at Rui's solution, which is certainly much clearer than my own. Managed to figure out pretty much all the details of how it works. Also managed to tweak it slightly in order to make it do exactly what I wanted. (See revised code below.) Still have a couple of questions though. The first concerns the insertion of the code Y 2012 to set year values beyond 2012 to NA (on line 10 of the function below). When I add this (or use it in place of nchar(Y) 4), the code succesfully finds the problem date 05/16/2015. After that though, it produces the following error message: Error in if (any(is.na(x) M != un Y != un)) cat(Warning: Invalid date values in, : missing value where TRUE/FALSE needed It's a bit dangerous to use comparison operators on mixed data types. In your case you are comparing a character value to a numeric value and may not realize that 2015 is not the same as 2015. Try 123 1000 if you want a quick counter-example. You may want to coerce the Y value to numeric mode to be safe. Also 'any' does not expect the logical connectives. You probably want: any(is.na(x) , M != un , Y != un) Perhaps I am missing something relevant here, but I am still confused by what I see as an over engineering of the code being implemented. If the primary requirements are: 1. Impute the 15th of month if it is 'un' 2. Reject dates prior to 1900 or after 2011 3. Reject dates with an unknown ('un') month or year 4. Reject years with 4 digits, also presuming that the value passed should always be 10 characters in length If that is the basic functionality required, then a modest modification of my prior code should work: checkDate - function(x) { # Replace unknown day with 15 tmp - gsub(/un/, /15/, x) tmp2 - as.Date(tmp, format = %m/%d/%Y) as.character(x[is.na(tmp2) | tmp2 as.Date(1900/01/01) | tmp2 as.Date(2012/01/01) | nchar(as.character(x)) 10]) } TestDates Patient birthDT diagnosisDT metastaticDT 1 1 11/23/21931 05/23/2009 un/17/2011 2 2 06/20/1840 02/30/2010 03/17/2011 3 3 06/17/1935 12/20/2008 07/un/2011 4 4 05/31/1937 01/18/2007 04/30/2011 5 5 06/31/1933 05/16/2015 11/20/un lapply(TestDates[, -1], checkDate) $birthDT [1] 11/23/21931 06/20/1840 06/31/1933 $diagnosisDT [1] 02/30/2010 05/16/2015 $metastaticDT [1] un/17/2011 11/20/un Does that not do what you require Paul? Marc Why is this happening? If the code correctly correctly handles the date 06/20/1840 without producing an error, why can't it do likelwise with 05/16/2015? The second question is why it's necessary to put x on line 15 following cat(Warning ...). I know that I don't get any date columns if I don't include this but am not sure why. The third question is whether it's possible to change the class of the date variables without using a for loop. I played around with this a little but didn't find a vectorized alternative. It may be that this is not really important. It's just that I've read in several places that for loops should be avoided wherever possible. Thanks, Paul snip prior content __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Checking for invalid dates: Code works but needs improvement
On Jan 30, 2012, at 1:30 PM, Marc Schwartz wrote: On Jan 30, 2012, at 12:15 PM, David Winsemius wrote: On Jan 30, 2012, at 8:44 AM, Paul Miller wrote: Hi Rui, Marc, and Gabor, Thanks for your replies to my question. All were helpful and it was interesting to see how different people approach various aspects of the same problem. Spent some time this weekend looking at Rui's solution, which is certainly much clearer than my own. Managed to figure out pretty much all the details of how it works. Also managed to tweak it slightly in order to make it do exactly what I wanted. (See revised code below.) Still have a couple of questions though. The first concerns the insertion of the code Y 2012 to set year values beyond 2012 to NA (on line 10 of the function below). When I add this (or use it in place of nchar(Y) 4), the code succesfully finds the problem date 05/16/2015. After that though, it produces the following error message: Error in if (any(is.na(x) M != un Y != un)) cat(Warning: Invalid date values in, : missing value where TRUE/FALSE needed It's a bit dangerous to use comparison operators on mixed data types. In your case you are comparing a character value to a numeric value and may not realize that 2015 is not the same as 2015. Try 123 1000 if you want a quick counter-example. You may want to coerce the Y value to numeric mode to be safe. Also 'any' does not expect the logical connectives. You probably want: any(is.na(x) , M != un , Y != un) Perhaps I am missing something relevant here, but I am still confused by what I see as an over engineering of the code being implemented. If the primary requirements are: 1. Impute the 15th of month if it is 'un' 2. Reject dates prior to 1900 or after 2011 3. Reject dates with an unknown ('un') month or year 4. Reject years with 4 digits, also presuming that the value passed should always be 10 characters in length If that is the basic functionality required, then a modest modification of my prior code should work: Ack...typo in my code for the upper end of the date range. Should be: checkDate - function(x) { # Replace unknown day with 15 tmp - gsub(/un/, /15/, x) tmp2 - as.Date(tmp, format = %m/%d/%Y) as.character(x[is.na(tmp2) | tmp2 as.Date(1900/01/01) | tmp2 as.Date(2011/12/31) | nchar(as.character(x)) 10]) } Marc __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Checking for invalid dates: Code works but needs improvement
Hello, I'm glad it helped. Error in if (any(is.na(x) M != un Y != un)) cat(Warning: Invalid date values in, : missing value where TRUE/FALSE needed Why is this happening? If the code correctly correctly handles the date 06/20/1840 without producing an error, why can't it do likelwise with 05/16/2015? Because un is greater than 2012. The problem starts with the 'ifelse', we are changing the Y values to NA's if the condition is met. Correction: instead of ifelse use Known - M != un Y != un Y[nchar(Y) 4 | Y 2012 | Y 1900] - NA Now you need to place the conjuntion of Known and is.na(x) in both the 'if' and 'cat' statements. (Only dates with known year and month but wrongly keyed in will be printed.) if(any(Known is.na(x))) cat(Warning: Invalid date values in, jj, \n, as.character(DF[Known is.na(x), jj]), \n) Like David said, 'any' expects comma separated logical vectors but in this case the conjunction agrees more with what is wanted. It's an adversative conjunction, that reads Known BUT is.na(x). The point is that according to your code the unknown shouldn't be printed. They are just NA, not invalid. Does this also answer to the second question? As for the third question, there are so few columns that I don't believe the for loop can hurt. There could be a problem in changing the class to Date using *apply because R passes arguments to functions by value and only the copy inside the function would be changed. Considering the number of iterations in the loop, it's maybe simpler like this. Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4341018p4342250.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Checking for invalid dates: Code works but needs improvement
Sorry, sent this earlier but forgot to add an informative subject line. Am resending, in the hopes of getting further replies. My apologies. Hope this is OK. Paul Hi Rui, Thanks for your reply to my post. My code still has various shortcomings but at least now it is fully functional. It may be that, as I transition to using R, I'll have to live with some less than ideal code, at least at the outset. I'll just have to write and re-write my code as I improve. Appreciate your help. Paul Message: 66 Date: Tue, 24 Jan 2012 09:54:57 -0800 (PST) From: Rui Barradas ruipbarra...@sapo.pt To: r-help@r-project.org Subject: Re: [R] Checking for invalid dates: Code works but needs improvement Message-ID: 1327427697928-4324533.p...@n4.nabble.com Content-Type: text/plain; charset=us-ascii Hello, Point 3 is very simple, instead of 'print' use 'cat'. Unlike 'print' it allows for several arguments and (very) simple formating. { cat(Error: Invalid date values in, DateNames[[i]], \n, TestDates[DateNames][[i]][TestDates$Invalid==1], \n) } Rui Barradas Message: 53 Date: Tue, 24 Jan 2012 08:54:49 -0800 (PST) From: Paul Miller pjmiller...@yahoo.com To: r-help@r-project.org Subject: [R] Checking for invalid dates: Code works but needs improvement Message-ID: 1327424089.1149.yahoomailclas...@web161604.mail.bf1.yahoo.com Content-Type: text/plain; charset=us-ascii Hello Everyone, Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it. Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are missing in the sense that the month or year are unknown and prints out any remaining invalid date values. As I see it, the code has at least 4 shortcomings. 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines. 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector. 3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like: Error: Invalid date values in birthDT 21931-11-23 1933-06-31 Error: Invalid date values in diagnosisDT 2010-02-30 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop. Thanks, Paul ## Code for detecting invalid dates ## Test Data connection - textConnection( 1 11/23/21931 05/23/2009 un/17/2011 2 06/20/1940 02/30/2010 03/17/2011 3 06/17/1935 12/20/2008 07/un/2011 4 05/31/1937 01/18/2007 04/30/2011 5 06/31/1933 05/16/2009 11/20/un ) TestDates - data.frame(scan(connection, list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=))) close(connection) TestDates class(TestDates$birthDT) class(TestDates$diagnosisDT) class(TestDates$metastaticDT) List of Date Variables DateNames - c(birthDT, diagnosisDT, metastaticDT) Read Dates for (i in seq(TestDates[DateNames])){ TestDates[DateNames][[i]] - as.character(TestDates[DateNames][[i]]) TestDates$ParsedDT - strsplit(TestDates[DateNames][[i]],/) TestDates$Month - sapply(TestDates$ParsedDT,function(x)x[1]) TestDates$Day - sapply(TestDates$ParsedDT,function(x)x[2]) TestDates$Year - sapply(TestDates$ParsedDT,function(x)x[3]) TestDates$Day[TestDates$Day==un] - 15 TestDates[DateNames][[i]] - with(TestDates, paste(Year, Month, Day, sep = -)) is.na( TestDates[DateNames][[i]] [TestDates$Month==un] ) - T is.na( TestDates[DateNames][[i]] [TestDates$Year==un] ) - T TestDates$Date - as.Date(TestDates[DateNames][[i]], format=%Y-%m-%d) TestDates$Invalid - ifelse(is.na(TestDates$Date) !is.na(TestDates[DateNames][[i]]), 1, 0) if( sum(TestDates$Invalid)==0 ) { TestDates[DateNames][[i]] - TestDates$Date } else { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) } TestDates - subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, Invalid)) } TestDates class(TestDates$birthDT) class(TestDates$diagnosisDT) class(TestDates$metastaticDT) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http
Re: [R] Checking for invalid dates: Code works but needs improvement
Paul, I have a partial solution for you. It is partial in that I have not quite figured out the correct incantation to convert a 5 digit year (eg. 11/23/21931) properly using the R date functions. According to various sources (eg. man strptime and man strftime) as well as the R help for both functions, there are extended formats available, but I am having a bout of cerebral flatulence in getting them to work correctly and a search has not been fruitful. Perhaps someone else can offer some insights. That being said, with the exception of correctly handling that one situation, which arguably IS a valid date a long time in the future and which would otherwise result in a truncated year (first four digits only) as.Date(11/23/21931, format = %m/%d/%Y) [1] 2193-11-23 Here is one approach: # Check the date. If as.Date() fails or the input is 10 characters return it checkDate - function(x) as.character(x[is.na(as.Date(x, format = %m/%d/%Y)) | nchar(as.character(x)) 10]) lapply(TestDates[, -1], checkDate) $birthDT [1] 11/23/21931 06/31/1933 $diagnosisDT [1] 02/30/2010 $metastaticDT [1] un/17/2011 07/un/2011 11/20/un You could fine tune the checkDate() function to handle other formats, etc. HTH, Marc Schwartz On Jan 26, 2012, at 9:54 AM, Paul Miller wrote: Sorry, sent this earlier but forgot to add an informative subject line. Am resending, in the hopes of getting further replies. My apologies. Hope this is OK. Paul Hi Rui, Thanks for your reply to my post. My code still has various shortcomings but at least now it is fully functional. It may be that, as I transition to using R, I'll have to live with some less than ideal code, at least at the outset. I'll just have to write and re-write my code as I improve. Appreciate your help. Paul Message: 66 Date: Tue, 24 Jan 2012 09:54:57 -0800 (PST) From: Rui Barradas ruipbarra...@sapo.pt To: r-help@r-project.org Subject: Re: [R] Checking for invalid dates: Code works but needs improvement Message-ID: 1327427697928-4324533.p...@n4.nabble.com Content-Type: text/plain; charset=us-ascii Hello, Point 3 is very simple, instead of 'print' use 'cat'. Unlike 'print' it allows for several arguments and (very) simple formating. { cat(Error: Invalid date values in, DateNames[[i]], \n, TestDates[DateNames][[i]][TestDates$Invalid==1], \n) } Rui Barradas Message: 53 Date: Tue, 24 Jan 2012 08:54:49 -0800 (PST) From: Paul Miller pjmiller...@yahoo.com To: r-help@r-project.org Subject: [R] Checking for invalid dates: Code works but needs improvement Message-ID: 1327424089.1149.yahoomailclas...@web161604.mail.bf1.yahoo.com Content-Type: text/plain; charset=us-ascii Hello Everyone, Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it. Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are missing in the sense that the month or year are unknown and prints out any remaining invalid date values. As I see it, the code has at least 4 shortcomings. 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines. 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector. 3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like: Error: Invalid date values in birthDT 21931-11-23 1933-06-31 Error: Invalid date values in diagnosisDT 2010-02-30 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop. Thanks, Paul ## Code for detecting invalid dates ## Test Data connection - textConnection( 1 11/23/21931 05/23/2009 un/17/2011 2 06/20/1940 02/30/2010 03/17/2011 3 06/17/1935 12/20/2008 07/un/2011 4 05/31/1937 01/18/2007 04/30/2011 5 06/31/1933 05/16/2009 11/20/un ) TestDates - data.frame(scan(connection, list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=))) close(connection
Re: [R] Checking for invalid dates: Code works but needs improvement
On Tue, Jan 24, 2012 at 11:54 AM, Paul Miller pjmiller...@yahoo.com wrote: Hello Everyone, Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it. Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are missing in the sense that the month or year are unknown and prints out any remaining invalid date values. As I see it, the code has at least 4 shortcomings. 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines. 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector. 3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like: Error: Invalid date values in birthDT 21931-11-23 1933-06-31 Error: Invalid date values in diagnosisDT 2010-02-30 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop. Thanks, Paul ## Code for detecting invalid dates ## Test Data connection - textConnection( 1 11/23/21931 05/23/2009 un/17/2011 2 06/20/1940 02/30/2010 03/17/2011 3 06/17/1935 12/20/2008 07/un/2011 4 05/31/1937 01/18/2007 04/30/2011 5 06/31/1933 05/16/2009 11/20/un ) TestDates - data.frame(scan(connection, list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=))) close(connection) TestDates class(TestDates$birthDT) class(TestDates$diagnosisDT) class(TestDates$metastaticDT) List of Date Variables DateNames - c(birthDT, diagnosisDT, metastaticDT) Read Dates for (i in seq(TestDates[DateNames])){ TestDates[DateNames][[i]] - as.character(TestDates[DateNames][[i]]) TestDates$ParsedDT - strsplit(TestDates[DateNames][[i]],/) TestDates$Month - sapply(TestDates$ParsedDT,function(x)x[1]) TestDates$Day - sapply(TestDates$ParsedDT,function(x)x[2]) TestDates$Year - sapply(TestDates$ParsedDT,function(x)x[3]) TestDates$Day[TestDates$Day==un] - 15 TestDates[DateNames][[i]] - with(TestDates, paste(Year, Month, Day, sep = -)) is.na( TestDates[DateNames][[i]] [TestDates$Month==un] ) - T is.na( TestDates[DateNames][[i]] [TestDates$Year==un] ) - T TestDates$Date - as.Date(TestDates[DateNames][[i]], format=%Y-%m-%d) TestDates$Invalid - ifelse(is.na(TestDates$Date) !is.na(TestDates[DateNames][[i]]), 1, 0) if( sum(TestDates$Invalid)==0 ) { TestDates[DateNames][[i]] - TestDates$Date } else { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) } TestDates - subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, Invalid)) } TestDates class(TestDates$birthDT) class(TestDates$diagnosisDT) class(TestDates$metastaticDT) If s is a vector of character strings representing dates then bad is a logical vector which is TRUE for the bad ones and FALSE for the good ones (adjust as needed if a different date range is valid) so s[bad] is the bad inputs and the output d is a Date vector with NAs for the bad ones: x - gsub(un, 15, s) d - as.Date(x, %m/%d/%Y) bad - is.na(d) | d as.Date(1900-01-01) | d Sys.Date() d[bad] - NA -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Checking for invalid dates: Code works but needs improvement
Hello, again. I now have a more complete answer to your points. 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines. It's not very shorter but it's more readable. (The programmer is always suspect) 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector. All but one instructions are vectorized. And the one that is not only loops for a few column names. Use 'unlist' on the 'strsplit' function's output to give a vector. 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop. You can now specify any matrix or data.frame, but it will only process the columns with dates. (This is not true, it will process anything with a '/' on it. Pay attention.) Near the beginning of your code include the following: TestDates - data.frame(scan(connection, list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=))) close(connection) TDSaved - TestDates# to avoid reopenning the connection And then, after all of it, fun - function(Dat){ f - function(jj, DF){ x - as.character(DF[, jj]) x - unlist(strsplit(x, /)) n - length(x) M - x[seq(1, n, 3)] D - x[seq(2, n, 3)] Y - x[seq(3, n, 3)] D[D == un] - 15 Y - ifelse(nchar(Y) 4 | Y 1900, NA, Y) x - as.Date(paste(Y, M, D, sep=-), format=%Y-%m-%d) if(any(is.na(x))) cat(Warning: Invalid date values in, jj, \n, as.character(DF[is.na(x), jj]), \n) x } colinx - colnames(as.data.frame(Dat)) Dat - data.frame(sapply(colinx, function(j) f(j, Dat))) for(i in colinx) class(Dat[[i]]) - Date Dat } TD - TDSaved TD[, DateNames] - fun(TD[, DateNames]) TD Had fun in writing it. Good luck. Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4324356p4332529.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Checking for invalid dates: Code works but needs improvement
Hello Everyone, Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it. Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are missing in the sense that the month or year are unknown and prints out any remaining invalid date values. As I see it, the code has at least 4 shortcomings. 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines. 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector. 3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like: Error: Invalid date values in birthDT 21931-11-23 1933-06-31 Error: Invalid date values in diagnosisDT 2010-02-30 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop. Thanks, Paul ## Code for detecting invalid dates ## Test Data connection - textConnection( 1 11/23/21931 05/23/2009 un/17/2011 2 06/20/1940 02/30/2010 03/17/2011 3 06/17/1935 12/20/2008 07/un/2011 4 05/31/1937 01/18/2007 04/30/2011 5 06/31/1933 05/16/2009 11/20/un ) TestDates - data.frame(scan(connection, list(Patient=0, birthDT=, diagnosisDT=, metastaticDT=))) close(connection) TestDates class(TestDates$birthDT) class(TestDates$diagnosisDT) class(TestDates$metastaticDT) List of Date Variables DateNames - c(birthDT, diagnosisDT, metastaticDT) Read Dates for (i in seq(TestDates[DateNames])){ TestDates[DateNames][[i]] - as.character(TestDates[DateNames][[i]]) TestDates$ParsedDT - strsplit(TestDates[DateNames][[i]],/) TestDates$Month - sapply(TestDates$ParsedDT,function(x)x[1]) TestDates$Day - sapply(TestDates$ParsedDT,function(x)x[2]) TestDates$Year - sapply(TestDates$ParsedDT,function(x)x[3]) TestDates$Day[TestDates$Day==un] - 15 TestDates[DateNames][[i]] - with(TestDates, paste(Year, Month, Day, sep = -)) is.na( TestDates[DateNames][[i]] [TestDates$Month==un] ) - T is.na( TestDates[DateNames][[i]] [TestDates$Year==un] ) - T TestDates$Date - as.Date(TestDates[DateNames][[i]], format=%Y-%m-%d) TestDates$Invalid - ifelse(is.na(TestDates$Date) !is.na(TestDates[DateNames][[i]]), 1, 0) if( sum(TestDates$Invalid)==0 ) { TestDates[DateNames][[i]] - TestDates$Date } else { print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) } TestDates - subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, Invalid)) } TestDates class(TestDates$birthDT) class(TestDates$diagnosisDT) class(TestDates$metastaticDT) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Checking for invalid dates: Code works but needs improvement
Hello, Point 3 is very simple, instead of 'print' use 'cat'. Unlike 'print' it allows for several arguments and (very) simple formating. { cat(Error: Invalid date values in, DateNames[[i]], \n, TestDates[DateNames][[i]][TestDates$Invalid==1], \n) } Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4324356p4324533.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.