Re: [R] Need a faster function to replace missing data
Many thanks to Jim, Bill, and Carl. Using indexes instead of the for loop gave me my answer in minutes instead of hours! Thanks for all of your great suggestions! Aloha, Tim Tim Clark Department of Zoology University of Hawaii --- On Fri, 5/22/09, jim holtman jholt...@gmail.com wrote: From: jim holtman jholt...@gmail.com Subject: Re: [R] Need a faster function to replace missing data To: Tim Clark mudiver1...@yahoo.com Cc: r-help@r-project.org Date: Friday, May 22, 2009, 4:59 PM Here is a modification that should now find the closest: myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00), + format=%H:%M:%S)) # convert to numeric names(myvscan)-c(Latitude,DateTime) myvscan$tn - as.numeric(myvscan$DateTime) # numeric for findInterval mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00), + format=%H:%M:%S)) names(mygarmin)-c(Latitude,DateTime) mygarmin$tn - as.numeric(mygarmin$DateTime) # use 'findInterval' na.indx - which(is.na(myvscan$Latitude)) # find NAs # create matrix of values to test the range indices - findInterval(myvscan$tn[na.indx],mygarmin$tn) x - cbind(indices, + abs(myvscan$tn[na.indx] - mygarmin$tn[indices]), # lower + abs(myvscan$tn[na.indx] - mygarmin$tn[indices + 1])) #higher # now determine which index is closer closest - x[,1] + (x[,2] x[,3]) # determine the proper index # replace with garmin latitude myvscan$Latitude[na.indx] - mygarmin$Latitude[closest] myvscan Latitude DateTime tn 1 1.0 2009-05-23 12:00:00 124308 2 40.0 2009-05-23 12:14:00 1243080840 3 1.5 2009-05-23 12:20:00 1243081200 On Fri, May 22, 2009 at 7:39 PM, Tim Clark mudiver1...@yahoo.com wrote: Jim, Thanks! I like the way you use indexing instead of the loops. However, the find.Interval function does not give the right result. I have been playing with it and it seems to give the closest number that is less than the one of interest. In this case, the correct replacement should have been 40, not 30, since 12:15 from mygarmin is closer to 12:14 in myvscan than 12:10. Is there a way to get the function to find the closest in value instead of the next smaller value? I was trying to use which.min to get the closet date but can't seem to get it to work right either. Aloha, Tim Tim Clark Department of Zoology University of Hawaii --- On Fri, 5/22/09, jim holtman jholt...@gmail.com wrote: From: jim holtman jholt...@gmail.com Subject: Re: [R] Need a faster function to replace missing data To: Tim Clark mudiver1...@yahoo.com Cc: r-help@r-project.org Date: Friday, May 22, 2009, 7:24 AM I think this does what you want. It uses 'findInterval' to determine where a possible match is: myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00), format=%H:%M:%S)) # convert to numeric names(myvscan)-c(Latitude,DateTime) myvscan$tn - as.numeric(myvscan$DateTime) # numeric for findInterval mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00), format=%H:%M:%S)) names(mygarmin)-c(Latitude,DateTime) mygarmin$tn - as.numeric(mygarmin$DateTime) # use 'findInterval' na.indx - which(is.na(myvscan$Latitude)) # find NAs # replace with garmin latitude myvscan$Latitude[na.indx] - mygarmin$Latitude[findInterval(myvscan$tn[na.indx], mygarmin$tn)] myvscan Latitude DateTime tn 1 1.0 2009-05-22 12:00:00 1243008000 2 30.0 2009-05-22 12:14:00 1243008840 3 1.5 2009-05-22 12:20:00 1243009200 On Fri, May 22, 2009 at 12:45 AM, Tim Clark mudiver1...@yahoo.com wrote: Dear List, I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. My problem is from a fish tracking study. We put acoustic tags in fish and track them for several days. Location data is supposed to be automatically recorded every time we detect a ping from the fish. Unfortunately the GPS had some problems and sometimes the fishes depth was recorded but not its location. I fortunately had a back-up GPS that was taking location data every five minutes. I would like to merge the two files, replacing the missing value in the vscan (automatic) file with the location from the garmin file. Since we were getting vscan records every 1-2 seconds and garmin records every 5 minutes, I need to find the right place in the vscan file to place the garmin record - i.e. the closest in time, but not greater than 5 minutes. I have written a function
Re: [R] Need a faster function to replace missing data
Tim Clark mudiver1200 at yahoo.com writes: I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. This is the type of job I would do with a database, not R (alone). The main advantage is that you have to do the cleanup job only once and can retrieve the data in a rather well-documented way later (it's possible with R, I know). Put the 5 minutes data into one table. I would two new columns giving the delta to the next value for easier linear interpolation, but that's secondary. Make sure to index the table. Put the 1 seconds data into another table, adding values rounded to 5 seconds, and giving these an index. From R/ODBC or with RSQLite, make a Join of all values in Table 1 that do have NULL values in the coordinates. If you do not want to do a linear interpolation, you could even do this within the database and SQL alone. Compute the linear interpolation, and write the data back into the database. If you want to be careful, you might mark the interpolated values in a separate field as computed When at a later time new data come in, you can run the procedure again without penalty. Dieter __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need a faster function to replace missing data
Here are 2 functions, which.just.above and which.just.below, which may help you. They will tell which element in a reference dataset is the first just above (or just below) each element in the main dataset (x). They return NA if there is no reference element above (or below) an element of x. The strict argument lets you say if the inequalities are strict or if equality is acceptable. They are vectorized so are pretty quick. E.g., which.just.below(c(14,14.5), 11:15, strict=TRUE) [1] 3 4 which.just.above(c(14,14.5), 11:15, strict=FALSE) [1] 4 5 They should work with any class of data that order() and sort() work on. In particular, POSIXct times work. The attached file has a demonstration function called 'test' with some examples. In your case the 'reference' data would be the times at which your backup measurements were taken and the 'x' data would be the times of the pings. You can look at the elements of 'reference' just before and just after each ping (or just the pings that are missing locations) and decide how to combine the data from the bracketing reference elements to inpute a location for the ping. Here are the functions, in case the attachment doesn't make it through. I'm sure some mailer will throw in some newlines so it will be corrupted. which.just.above - function(x, reference, strict = T) { # output[k] will be index of smallest value in reference vector # larger than x[k]. If strict=F, replace 'larger than' by # 'larger than or equal to'. # We should allow NA's in x (but we don't). NA's in reference # should not be allowed. if(any(is.na(x)) || any(is.na(reference))) stop(NA's in input) if(strict) i - c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i - c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i - cumsum(i)[!i] + 1. i[i length(reference)] - NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i - order(reference)[i] # following needed if x is not sorted i[order(order(x))] } which.just.below - function(x, reference, strict = T) { # output[k] will be index of largest value in reference vector # less than x[k]. If strict=F, replace 'less than' by # 'less than or equal to'. Neither x nor reference need be # sorted, although they should not have NA's (in theory, NA's # in x are ok, but not in reference). if(any(is.na(x)) || any(is.na(reference))) stop(NA's in input) if(!strict) i - c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i - c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i - cumsum(i)[!i] i[i = 0] - NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i - order(reference)[i] # following needed if x is not sorted i[order(order(x))] } Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Tim Clark Sent: Thursday, May 21, 2009 9:45 PM To: r-help@r-project.org Subject: [R] Need a faster function to replace missing data Dear List, I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. My problem is from a fish tracking study. We put acoustic tags in fish and track them for several days. Location data is supposed to be automatically recorded every time we detect a ping from the fish. Unfortunately the GPS had some problems and sometimes the fishes depth was recorded but not its location. I fortunately had a back-up GPS that was taking location data every five minutes. I would like to merge the two files, replacing the missing value in the vscan (automatic) file with the location from the garmin file. Since we were getting vscan records every 1-2 seconds and garmin records every 5 minutes, I need to find the right place in the vscan file to place the garmin record - i.e. the closest in time, but not greater than 5 minutes. I have written a function that does this. However, it works with my test data but locks up my computer with my real data. I have several million vscan records and several thousand garmin records. Is there a better way to do this? My function and test data: myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00, 12:20:00))) names(myvscan)-c
Re: [R] Need a faster function to replace missing data
I think this does what you want. It uses 'findInterval' to determine where a possible match is: myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00), format=%H:%M:%S)) # convert to numeric names(myvscan)-c(Latitude,DateTime) myvscan$tn - as.numeric(myvscan$DateTime) # numeric for findInterval mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00), format=%H:%M:%S)) names(mygarmin)-c(Latitude,DateTime) mygarmin$tn - as.numeric(mygarmin$DateTime) # use 'findInterval' na.indx - which(is.na(myvscan$Latitude)) # find NAs # replace with garmin latitude myvscan$Latitude[na.indx] - mygarmin$Latitude[findInterval(myvscan$tn[na.indx], mygarmin$tn)] myvscan LatitudeDateTime tn 1 1.0 2009-05-22 12:00:00 1243008000 2 30.0 2009-05-22 12:14:00 1243008840 3 1.5 2009-05-22 12:20:00 1243009200 On Fri, May 22, 2009 at 12:45 AM, Tim Clark mudiver1...@yahoo.com wrote: Dear List, I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. My problem is from a fish tracking study. We put acoustic tags in fish and track them for several days. Location data is supposed to be automatically recorded every time we detect a ping from the fish. Unfortunately the GPS had some problems and sometimes the fishes depth was recorded but not its location. I fortunately had a back-up GPS that was taking location data every five minutes. I would like to merge the two files, replacing the missing value in the vscan (automatic) file with the location from the garmin file. Since we were getting vscan records every 1-2 seconds and garmin records every 5 minutes, I need to find the right place in the vscan file to place the garmin record - i.e. the closest in time, but not greater than 5 minutes. I have written a function that does this. However, it works with my test data but locks up my computer with my real data. I have several million vscan records and several thousand garmin records. Is there a better way to do this? My function and test data: myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00))) names(myvscan)-c(Latitude,DateTime) mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00))) names(mygarmin)-c(Latitude,DateTime) minute.diff-1/24/12 #Time diff is in days, so this is 5 minutes for (k in 1:nrow(myvscan)) { if (is.na(myvscan$Latitude[k])) { if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) minute.diff ) { index.min.date-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) myvscan$Latitude[k]-mygarmin$Latitude[index.min.date] }}} I appreciate your help and advice. Aloha, Tim Tim Clark Department of Zoology University of Hawaii __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need a faster function to replace missing data
Jim, Thanks! I like the way you use indexing instead of the loops. However, the find.Interval function does not give the right result. I have been playing with it and it seems to give the closest number that is less than the one of interest. In this case, the correct replacement should have been 40, not 30, since 12:15 from mygarmin is closer to 12:14 in myvscan than 12:10. Is there a way to get the function to find the closest in value instead of the next smaller value? I was trying to use which.min to get the closet date but can't seem to get it to work right either. Aloha, Tim Tim Clark Department of Zoology University of Hawaii --- On Fri, 5/22/09, jim holtman jholt...@gmail.com wrote: From: jim holtman jholt...@gmail.com Subject: Re: [R] Need a faster function to replace missing data To: Tim Clark mudiver1...@yahoo.com Cc: r-help@r-project.org Date: Friday, May 22, 2009, 7:24 AM I think this does what you want. It uses 'findInterval' to determine where a possible match is: myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00), format=%H:%M:%S)) # convert to numeric names(myvscan)-c(Latitude,DateTime) myvscan$tn - as.numeric(myvscan$DateTime) # numeric for findInterval mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00), format=%H:%M:%S)) names(mygarmin)-c(Latitude,DateTime) mygarmin$tn - as.numeric(mygarmin$DateTime) # use 'findInterval' na.indx - which(is.na(myvscan$Latitude)) # find NAs # replace with garmin latitude myvscan$Latitude[na.indx] - mygarmin$Latitude[findInterval(myvscan$tn[na.indx], mygarmin$tn)] myvscan Latitude DateTime tn 1 1.0 2009-05-22 12:00:00 1243008000 2 30.0 2009-05-22 12:14:00 1243008840 3 1.5 2009-05-22 12:20:00 1243009200 On Fri, May 22, 2009 at 12:45 AM, Tim Clark mudiver1...@yahoo.com wrote: Dear List, I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. My problem is from a fish tracking study. We put acoustic tags in fish and track them for several days. Location data is supposed to be automatically recorded every time we detect a ping from the fish. Unfortunately the GPS had some problems and sometimes the fishes depth was recorded but not its location. I fortunately had a back-up GPS that was taking location data every five minutes. I would like to merge the two files, replacing the missing value in the vscan (automatic) file with the location from the garmin file. Since we were getting vscan records every 1-2 seconds and garmin records every 5 minutes, I need to find the right place in the vscan file to place the garmin record - i.e. the closest in time, but not greater than 5 minutes. I have written a function that does this. However, it works with my test data but locks up my computer with my real data. I have several million vscan records and several thousand garmin records. Is there a better way to do this? My function and test data: myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00))) names(myvscan)-c(Latitude,DateTime) mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00))) names(mygarmin)-c(Latitude,DateTime) minute.diff-1/24/12 #Time diff is in days, so this is 5 minutes for (k in 1:nrow(myvscan)) { if (is.na(myvscan$Latitude[k])) { if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) minute.diff ) { index.min.date-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) myvscan$Latitude[k]-mygarmin$Latitude[index.min.date] }}} I appreciate your help and advice. Aloha, Tim Tim Clark Department of Zoology University of Hawaii __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need a faster function to replace missing data
Here is a modification that should now find the closest: myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00), + format=%H:%M:%S)) # convert to numeric names(myvscan)-c(Latitude,DateTime) myvscan$tn - as.numeric(myvscan$DateTime) # numeric for findInterval mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00), + format=%H:%M:%S)) names(mygarmin)-c(Latitude,DateTime) mygarmin$tn - as.numeric(mygarmin$DateTime) # use 'findInterval' na.indx - which(is.na(myvscan$Latitude)) # find NAs # create matrix of values to test the range indices - findInterval(myvscan$tn[na.indx],mygarmin$tn) x - cbind(indices, +abs(myvscan$tn[na.indx] - mygarmin$tn[indices]), # lower +abs(myvscan$tn[na.indx] - mygarmin$tn[indices + 1])) #higher # now determine which index is closer closest - x[,1] + (x[,2] x[,3]) # determine the proper index # replace with garmin latitude myvscan$Latitude[na.indx] - mygarmin$Latitude[closest] myvscan LatitudeDateTime tn 1 1.0 2009-05-23 12:00:00 124308 2 40.0 2009-05-23 12:14:00 1243080840 3 1.5 2009-05-23 12:20:00 1243081200 On Fri, May 22, 2009 at 7:39 PM, Tim Clark mudiver1...@yahoo.com wrote: Jim, Thanks! I like the way you use indexing instead of the loops. However, the find.Interval function does not give the right result. I have been playing with it and it seems to give the closest number that is less than the one of interest. In this case, the correct replacement should have been 40, not 30, since 12:15 from mygarmin is closer to 12:14 in myvscan than 12:10. Is there a way to get the function to find the closest in value instead of the next smaller value? I was trying to use which.min to get the closet date but can't seem to get it to work right either. Aloha, Tim Tim Clark Department of Zoology University of Hawaii --- On Fri, 5/22/09, jim holtman jholt...@gmail.com wrote: From: jim holtman jholt...@gmail.com Subject: Re: [R] Need a faster function to replace missing data To: Tim Clark mudiver1...@yahoo.com Cc: r-help@r-project.org Date: Friday, May 22, 2009, 7:24 AM I think this does what you want. It uses 'findInterval' to determine where a possible match is: myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00), format=%H:%M:%S)) # convert to numeric names(myvscan)-c(Latitude,DateTime) myvscan$tn - as.numeric(myvscan$DateTime) # numeric for findInterval mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00), format=%H:%M:%S)) names(mygarmin)-c(Latitude,DateTime) mygarmin$tn - as.numeric(mygarmin$DateTime) # use 'findInterval' na.indx - which(is.na(myvscan$Latitude)) # find NAs # replace with garmin latitude myvscan$Latitude[na.indx] - mygarmin$Latitude[findInterval(myvscan$tn[na.indx], mygarmin$tn)] myvscan LatitudeDateTime tn 1 1.0 2009-05-22 12:00:00 1243008000 2 30.0 2009-05-22 12:14:00 1243008840 3 1.5 2009-05-22 12:20:00 1243009200 On Fri, May 22, 2009 at 12:45 AM, Tim Clark mudiver1...@yahoo.com wrote: Dear List, I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. My problem is from a fish tracking study. We put acoustic tags in fish and track them for several days. Location data is supposed to be automatically recorded every time we detect a ping from the fish. Unfortunately the GPS had some problems and sometimes the fishes depth was recorded but not its location. I fortunately had a back-up GPS that was taking location data every five minutes. I would like to merge the two files, replacing the missing value in the vscan (automatic) file with the location from the garmin file. Since we were getting vscan records every 1-2 seconds and garmin records every 5 minutes, I need to find the right place in the vscan file to place the garmin record - i.e. the closest in time, but not greater than 5 minutes. I have written a function that does this. However, it works with my test data but locks up my computer with my real data. I have several million vscan records and several thousand garmin records. Is there a better way to do this? My function and test data: myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00))) names(myvscan)-c(Latitude,DateTime) mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00))) names(mygarmin)-c(Latitude,DateTime) minute.diff-1/24/12 #Time diff is in days, so this is 5 minutes for (k in 1:nrow(myvscan)) { if (is.na(myvscan$Latitude[k])) { if ((min(abs(mygarmin$DateTime-myvscan
[R] Need a faster function to replace missing data
Dear List, I need some help in coming up with a function that will take two data sets, determine if a value is missing in one, find a value in the second that was taken at about the same time, and substitute the second value in for where the first should have been. My problem is from a fish tracking study. We put acoustic tags in fish and track them for several days. Location data is supposed to be automatically recorded every time we detect a ping from the fish. Unfortunately the GPS had some problems and sometimes the fishes depth was recorded but not its location. I fortunately had a back-up GPS that was taking location data every five minutes. I would like to merge the two files, replacing the missing value in the vscan (automatic) file with the location from the garmin file. Since we were getting vscan records every 1-2 seconds and garmin records every 5 minutes, I need to find the right place in the vscan file to place the garmin record - i.e. the closest in time, but not greater than 5 minutes. I have written a function that does this. However, it works with my test data but locks up my computer with my real data. I have several million vscan records and several thousand garmin records. Is there a better way to do this? My function and test data: myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00))) names(myvscan)-c(Latitude,DateTime) mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00))) names(mygarmin)-c(Latitude,DateTime) minute.diff-1/24/12 #Time diff is in days, so this is 5 minutes for (k in 1:nrow(myvscan)) { if (is.na(myvscan$Latitude[k])) { if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) minute.diff ) { index.min.date-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) myvscan$Latitude[k]-mygarmin$Latitude[index.min.date] }}} I appreciate your help and advice. Aloha, Tim Tim Clark Department of Zoology University of Hawaii __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.