Re: [R] Need a faster function to replace missing data

2009-05-26 Thread Tim Clark

Many thanks to Jim, Bill, and Carl.  Using indexes instead of the for loop gave 
me my answer in minutes instead of hours!  Thanks for all of your great 
suggestions!

Aloha,

Tim

Tim Clark
Department of Zoology 
University of Hawaii


--- On Fri, 5/22/09, jim holtman jholt...@gmail.com wrote:

 From: jim holtman jholt...@gmail.com
 Subject: Re: [R] Need a faster function to replace missing data
 To: Tim Clark mudiver1...@yahoo.com
 Cc: r-help@r-project.org
 Date: Friday, May 22, 2009, 4:59 PM
 Here is a modification that should
 now find the closest:
  
 
 myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00),
 + format=%H:%M:%S))
  # convert to numeric
  
 
 names(myvscan)-c(Latitude,DateTime)
 
  
  myvscan$tn - as.numeric(myvscan$DateTime)  #
 numeric for findInterval
  
 
 mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00),
 
 + format=%H:%M:%S))
  
  
 
 names(mygarmin)-c(Latitude,DateTime)
  mygarmin$tn - as.numeric(mygarmin$DateTime)
  
  # use 'findInterval'
 
  na.indx - which(is.na(myvscan$Latitude))  # find
 NAs
  
  # create matrix of values to test the range
  indices -
 findInterval(myvscan$tn[na.indx],mygarmin$tn)
 
  x - cbind(indices,
 +    abs(myvscan$tn[na.indx] -
 mygarmin$tn[indices]), # lower
 +    abs(myvscan$tn[na.indx] -
 mygarmin$tn[indices + 1]))  #higher
  # now determine which index is closer
 
  closest - x[,1] + (x[,2]  x[,3])  # determine
 the proper index
  # replace with garmin latitude
  myvscan$Latitude[na.indx] -
 mygarmin$Latitude[closest]
  
  
  
  myvscan
 
   Latitude    DateTime
 tn
 1  1.0 2009-05-23 12:00:00 124308
 2 40.0 2009-05-23 12:14:00 1243080840
 3  1.5 2009-05-23 12:20:00 1243081200
  
 
 
 
 On Fri, May 22, 2009 at 7:39 PM,
 Tim Clark mudiver1...@yahoo.com
 wrote:
 
 
 Jim,
 
 Thanks!  I like the way you use indexing instead of the
 loops.  However, the find.Interval function does not give
 the right result.  I have been playing with it and it seems
 to give the closest number that is less than the one of
 interest.  In this case, the correct replacement should
 have been 40, not 30, since 12:15 from mygarmin is closer to
 12:14 in myvscan than 12:10.  Is there a way to get the
 function to find the closest in value instead of the next
 smaller value?  I was trying to use which.min to get the
 closet date but can't seem to get it to work right
 either.
 
 
 
 Aloha,
 
 Tim
 
 
 Tim Clark
 Department of Zoology
 University of Hawaii
 
 
 --- On Fri, 5/22/09, jim holtman jholt...@gmail.com
 wrote:
 
 
  From: jim holtman jholt...@gmail.com
  Subject: Re: [R] Need a faster function to replace
 missing data
  To: Tim Clark mudiver1...@yahoo.com
 
  Cc: r-help@r-project.org
  Date: Friday, May 22, 2009, 7:24 AM
 
 
 
  I think this does what you
  want.  It uses 'findInterval' to determine
 where a
  possible match is:
   
  
 
 myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00),
 
  format=%H:%M:%S))
   # convert to numeric
  
 
 names(myvscan)-c(Latitude,DateTime)
 
   myvscan$tn - as.numeric(myvscan$DateTime) 
 #
 
  numeric for findInterval
  
 
 mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00),
  format=%H:%M:%S))
 
  
 
 
 names(mygarmin)-c(Latitude,DateTime)
   mygarmin$tn - as.numeric(mygarmin$DateTime)
  
   # use 'findInterval'
   na.indx - which(is.na(myvscan$Latitude))  # find
 
  NAs
 
   # replace with garmin latitude
   myvscan$Latitude[na.indx] -
  mygarmin$Latitude[findInterval(myvscan$tn[na.indx],
  mygarmin$tn)]
  
  
   myvscan
 
    Latitude   
 DateTime
  tn
 
  1  1.0 2009-05-22 12:00:00 1243008000
  2 30.0 2009-05-22 12:14:00 1243008840
  3  1.5 2009-05-22 12:20:00 1243009200
  
 
 
 
 
  On Fri, May 22, 2009 at 12:45 AM,
  Tim Clark mudiver1...@yahoo.com
  wrote:
 
 
  Dear List,
 
  I need some help in coming up with a function that
 will
 
  take two data sets, determine if a value is missing in
 one,
  find a value in the second that was taken at about the
 same
  time, and substitute the second value in for where the
 first
  should have been.  My problem is from a fish
 tracking
 
  study.  We put acoustic tags in fish and track them
 for
  several days.  Location data is supposed to be
  automatically recorded every time we detect a
  ping from the fish.  Unfortunately the
 GPS had
 
  some problems and sometimes the fishes depth was
 recorded
  but not its location.  I fortunately had a back-up
 GPS that
  was taking location data every five minutes.  I would
 like
  to merge the two files, replacing the missing value in
 the
 
  vscan (automatic) file with the location from the
 garmin
  file.  Since we were getting vscan records every 1-2
  seconds and garmin records every 5 minutes, I need to
 find
  the right place in the vscan file to place the garmin
 record
 
  - i.e. the
 
   closest in time, but not greater than 5 minutes.  I
 have
  written a function

Re: [R] Need a faster function to replace missing data

2009-05-22 Thread Dieter Menneq
Tim Clark mudiver1200 at yahoo.com writes:

 
 I need some help in coming up with a function that will take two data sets,
determine if a value is missing in
 one, find a value in the second that was taken at about the same time, and
substitute the second value in for
 where the first should have been.  

This is the type of job I would do with a database, not R (alone). The
main advantage is that you have to do the cleanup job only once and can
retrieve the data in a rather well-documented way later (it's possible
with R, I know).

 Put the 5 minutes data into one table. I would two new columns giving the 
delta to the next value for easier linear interpolation, but that's 
secondary. Make sure to index the table.

 Put the 1 seconds data into another table, adding values rounded to
5 seconds, and giving these an index.

From R/ODBC or with RSQLite, make a Join of all values in Table 1 
that do have NULL values in the coordinates. If you do not want to 
do a linear interpolation, you could even do this within the database
and SQL alone. 

 Compute the linear interpolation, and write the data back into
the database. If you want to be careful, you might mark the interpolated
values in a separate field as computed

When at a later time new data come in, you can run the procedure again
without penalty.

Dieter

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Need a faster function to replace missing data

2009-05-22 Thread William Dunlap
Here are 2 functions, which.just.above and which.just.below,
which may help you.  They will tell which element in a reference
dataset is the first just above (or just below) each element
in the main dataset (x).  They return NA if there is no reference
element above (or below) an element of x.  The strict argument
lets you say if the inequalities are strict or if equality is
acceptable.
They are vectorized so are pretty quick.

E.g.,
which.just.below(c(14,14.5), 11:15, strict=TRUE)
   [1] 3 4
which.just.above(c(14,14.5), 11:15, strict=FALSE)
   [1] 4 5
They should work with any class of data that order() and sort()
work on.  In particular, POSIXct times work.  The attached file
has a demonstration function called 'test' with some examples.

In your case the 'reference' data would be the times at which your
backup measurements were taken and the 'x' data would be the
times of the pings.  You can look at the elements of 'reference' just
before and just after each ping (or just the pings that are missing
locations) and decide how to combine the data from the bracketing
reference elements to inpute a location for the ping.

Here are the functions, in case the attachment doesn't make it
through.  I'm sure some mailer will throw in some newlines so
it will be corrupted.

which.just.above -
function(x, reference, strict = T)
{
# output[k] will be index of smallest value in reference vector
# larger than x[k].  If strict=F, replace 'larger than' by
# 'larger than or equal to'.
# We should allow NA's in x (but we don't). NA's in reference
# should not be allowed.
if(any(is.na(x)) || any(is.na(reference))) stop(NA's in input)
if(strict)
i - c(rep(T, length(reference)), rep(F,
length(x)))[order(
c(reference, x))]
else i - c(rep(F, length(x)), rep(T,
length(reference)))[order(c(
x, reference))]
i - cumsum(i)[!i] + 1.
i[i  length(reference)] - NA
# i is length of x and has values in range 1:length(reference)
or NA
# following needed if reference is not sorted
i - order(reference)[i]
# following needed if x is not sorted
i[order(order(x))]
}

which.just.below -
function(x, reference, strict = T)
{
# output[k] will be index of largest value in reference vector
# less than x[k].  If strict=F, replace 'less than' by
# 'less than or equal to'.  Neither x nor reference need be
# sorted, although they should not have NA's (in theory, NA's
# in x are ok, but not in reference).
if(any(is.na(x)) || any(is.na(reference))) stop(NA's in input)
if(!strict)
i - c(rep(T, length(reference)), rep(F,
length(x)))[order(
c(reference, x))]
else i - c(rep(F, length(x)), rep(T,
length(reference)))[order(c(
x, reference))]
i - cumsum(i)[!i]
i[i = 0] - NA
# i is length of x and has values in range 1:length(reference)
or NA
# following needed if reference is not sorted
i - order(reference)[i]
# following needed if x is not sorted
i[order(order(x))]
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Tim Clark
 Sent: Thursday, May 21, 2009 9:45 PM
 To: r-help@r-project.org
 Subject: [R] Need a faster function to replace missing data
 
 
 Dear List,
 
 I need some help in coming up with a function that will take 
 two data sets, determine if a value is missing in one, find a 
 value in the second that was taken at about the same time, 
 and substitute the second value in for where the first should 
 have been.  My problem is from a fish tracking study.  We put 
 acoustic tags in fish and track them for several days.  
 Location data is supposed to be automatically recorded every 
 time we detect a ping from the fish.  Unfortunately the GPS 
 had some problems and sometimes the fishes depth was recorded 
 but not its location.  I fortunately had a back-up GPS that 
 was taking location data every five minutes.  I would like to 
 merge the two files, replacing the missing value in the vscan 
 (automatic) file with the location from the garmin file.  
 Since we were getting vscan records every 1-2 seconds and 
 garmin records every 5 minutes, I need to find the right 
 place in the vscan file to place the garmin record - i.e. the
  closest in time, but not greater than 5 minutes.  I have 
 written a function that does this. However, it works with my 
 test data but locks up my computer with my real data.  I have 
 several million vscan records and several thousand garmin 
 records.  Is there a better way to do this?
 
 
 My function and test data:
 
 myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,
 12:20:00)))
 names(myvscan)-c

Re: [R] Need a faster function to replace missing data

2009-05-22 Thread jim holtman
I think this does what you want.  It uses 'findInterval' to determine where
a possible match is:


myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00),
format=%H:%M:%S))
 # convert to numeric
 names(myvscan)-c(Latitude,DateTime)
 myvscan$tn - as.numeric(myvscan$DateTime)  # numeric for findInterval

mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00),
format=%H:%M:%S))
 names(mygarmin)-c(Latitude,DateTime)
 mygarmin$tn - as.numeric(mygarmin$DateTime)

 # use 'findInterval'
 na.indx - which(is.na(myvscan$Latitude))  # find NAs
 # replace with garmin latitude
 myvscan$Latitude[na.indx] -
mygarmin$Latitude[findInterval(myvscan$tn[na.indx], mygarmin$tn)]


 myvscan
  LatitudeDateTime tn
1  1.0 2009-05-22 12:00:00 1243008000
2 30.0 2009-05-22 12:14:00 1243008840
3  1.5 2009-05-22 12:20:00 1243009200



On Fri, May 22, 2009 at 12:45 AM, Tim Clark mudiver1...@yahoo.com wrote:


 Dear List,

 I need some help in coming up with a function that will take two data sets,
 determine if a value is missing in one, find a value in the second that was
 taken at about the same time, and substitute the second value in for where
 the first should have been.  My problem is from a fish tracking study.  We
 put acoustic tags in fish and track them for several days.  Location data is
 supposed to be automatically recorded every time we detect a ping from the
 fish.  Unfortunately the GPS had some problems and sometimes the fishes
 depth was recorded but not its location.  I fortunately had a back-up GPS
 that was taking location data every five minutes.  I would like to merge the
 two files, replacing the missing value in the vscan (automatic) file with
 the location from the garmin file.  Since we were getting vscan records
 every 1-2 seconds and garmin records every 5 minutes, I need to find the
 right place in the vscan file to place the garmin record - i.e. the
  closest in time, but not greater than 5 minutes.  I have written a
 function that does this. However, it works with my test data but locks up my
 computer with my real data.  I have several million vscan records and
 several thousand garmin records.  Is there a better way to do this?


 My function and test data:

 myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00)))
 names(myvscan)-c(Latitude,DateTime)
 mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00)))
 names(mygarmin)-c(Latitude,DateTime)

 minute.diff-1/24/12   #Time diff is in days, so this is 5 minutes
 for (k in 1:nrow(myvscan))
 {
 if (is.na(myvscan$Latitude[k]))
 {
 if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k])))  minute.diff )
 {
 index.min.date-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k]))
 myvscan$Latitude[k]-mygarmin$Latitude[index.min.date]
 }}}

 I appreciate your help and advice.

 Aloha,

 Tim




 Tim Clark
 Department of Zoology
 University of Hawaii

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Need a faster function to replace missing data

2009-05-22 Thread Tim Clark

Jim,

Thanks!  I like the way you use indexing instead of the loops.  However, the 
find.Interval function does not give the right result.  I have been playing 
with it and it seems to give the closest number that is less than the one of 
interest.  In this case, the correct replacement should have been 40, not 30, 
since 12:15 from mygarmin is closer to 12:14 in myvscan than 12:10.  Is there a 
way to get the function to find the closest in value instead of the next 
smaller value?  I was trying to use which.min to get the closet date but can't 
seem to get it to work right either.

Aloha,

Tim


Tim Clark
Department of Zoology 
University of Hawaii


--- On Fri, 5/22/09, jim holtman jholt...@gmail.com wrote:

 From: jim holtman jholt...@gmail.com
 Subject: Re: [R] Need a faster function to replace missing data
 To: Tim Clark mudiver1...@yahoo.com
 Cc: r-help@r-project.org
 Date: Friday, May 22, 2009, 7:24 AM
 I think this does what you
 want.  It uses 'findInterval' to determine where a
 possible match is:
  
 
 myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00),
 format=%H:%M:%S))
  # convert to numeric
 
 names(myvscan)-c(Latitude,DateTime)
 
  myvscan$tn - as.numeric(myvscan$DateTime)  #
 numeric for findInterval
 
 mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00),
 format=%H:%M:%S))
 
 
 names(mygarmin)-c(Latitude,DateTime)
  mygarmin$tn - as.numeric(mygarmin$DateTime)
  
  # use 'findInterval'
  na.indx - which(is.na(myvscan$Latitude))  # find
 NAs
 
  # replace with garmin latitude
  myvscan$Latitude[na.indx] -
 mygarmin$Latitude[findInterval(myvscan$tn[na.indx],
 mygarmin$tn)]
  
  
  myvscan
   Latitude    DateTime
 tn
 
 1  1.0 2009-05-22 12:00:00 1243008000
 2 30.0 2009-05-22 12:14:00 1243008840
 3  1.5 2009-05-22 12:20:00 1243009200
  
 
 
 
 On Fri, May 22, 2009 at 12:45 AM,
 Tim Clark mudiver1...@yahoo.com
 wrote:
 
 
 Dear List,
 
 I need some help in coming up with a function that will
 take two data sets, determine if a value is missing in one,
 find a value in the second that was taken at about the same
 time, and substitute the second value in for where the first
 should have been.  My problem is from a fish tracking
 study.  We put acoustic tags in fish and track them for
 several days.  Location data is supposed to be
 automatically recorded every time we detect a
 ping from the fish.  Unfortunately the GPS had
 some problems and sometimes the fishes depth was recorded
 but not its location.  I fortunately had a back-up GPS that
 was taking location data every five minutes.  I would like
 to merge the two files, replacing the missing value in the
 vscan (automatic) file with the location from the garmin
 file.  Since we were getting vscan records every 1-2
 seconds and garmin records every 5 minutes, I need to find
 the right place in the vscan file to place the garmin record
 - i.e. the
 
  closest in time, but not greater than 5 minutes.  I have
 written a function that does this. However, it works with my
 test data but locks up my computer with my real data.  I
 have several million vscan records and several thousand
 garmin records.  Is there a better way to do this?
 
 
 
 My function and test data:
 
 myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00)))
 names(myvscan)-c(Latitude,DateTime)
 
 mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00)))
 names(mygarmin)-c(Latitude,DateTime)
 
 minute.diff-1/24/12   #Time diff is in days, so this
 is 5 minutes
 
 for (k in 1:nrow(myvscan))
 {
 if (is.na(myvscan$Latitude[k]))
 {
 if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) 
 minute.diff )
 {
 index.min.date-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k]))
 
 myvscan$Latitude[k]-mygarmin$Latitude[index.min.date]
 }}}
 
 I appreciate your help and advice.
 
 Aloha,
 
 Tim
 
 
 
 
 Tim Clark
 Department of Zoology
 University of Hawaii
 
 __
 
 R-help@r-project.org
 mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 
 and provide commented, minimal, self-contained,
 reproducible code.
 
 
 
 
 -- 
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390
 
 What is the problem that you are trying to solve?
 
 




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Need a faster function to replace missing data

2009-05-22 Thread jim holtman
Here is a modification that should now find the closest:


myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00),
+ format=%H:%M:%S))
 # convert to numeric

 names(myvscan)-c(Latitude,DateTime)

 myvscan$tn - as.numeric(myvscan$DateTime)  # numeric for findInterval


mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00),
+ format=%H:%M:%S))


 names(mygarmin)-c(Latitude,DateTime)
 mygarmin$tn - as.numeric(mygarmin$DateTime)

 # use 'findInterval'
 na.indx - which(is.na(myvscan$Latitude))  # find NAs

 # create matrix of values to test the range
 indices - findInterval(myvscan$tn[na.indx],mygarmin$tn)
 x - cbind(indices,
+abs(myvscan$tn[na.indx] - mygarmin$tn[indices]), # lower
+abs(myvscan$tn[na.indx] - mygarmin$tn[indices + 1]))  #higher
 # now determine which index is closer
 closest - x[,1] + (x[,2]  x[,3])  # determine the proper index
 # replace with garmin latitude
 myvscan$Latitude[na.indx] - mygarmin$Latitude[closest]



 myvscan
  LatitudeDateTime tn
1  1.0 2009-05-23 12:00:00 124308
2 40.0 2009-05-23 12:14:00 1243080840
3  1.5 2009-05-23 12:20:00 1243081200



On Fri, May 22, 2009 at 7:39 PM, Tim Clark mudiver1...@yahoo.com wrote:


 Jim,

 Thanks!  I like the way you use indexing instead of the loops.  However,
 the find.Interval function does not give the right result.  I have been
 playing with it and it seems to give the closest number that is less than
 the one of interest.  In this case, the correct replacement should have been
 40, not 30, since 12:15 from mygarmin is closer to 12:14 in myvscan than
 12:10.  Is there a way to get the function to find the closest in value
 instead of the next smaller value?  I was trying to use which.min to get the
 closet date but can't seem to get it to work right either.

 Aloha,

 Tim


 Tim Clark
 Department of Zoology
 University of Hawaii


 --- On Fri, 5/22/09, jim holtman jholt...@gmail.com wrote:

  From: jim holtman jholt...@gmail.com
  Subject: Re: [R] Need a faster function to replace missing data
  To: Tim Clark mudiver1...@yahoo.com
  Cc: r-help@r-project.org
  Date: Friday, May 22, 2009, 7:24 AM
   I think this does what you
  want.  It uses 'findInterval' to determine where a
  possible match is:
 
  
 
 myvscan-data.frame(c(1,NA,1.5),as.POSIXct(c(12:00:00,12:14:00,12:20:00),
  format=%H:%M:%S))
   # convert to numeric
  
  names(myvscan)-c(Latitude,DateTime)
 
   myvscan$tn - as.numeric(myvscan$DateTime)  #
  numeric for findInterval
  
 
 mygarmin-data.frame(c(20,30,40),as.POSIXct(c(12:00:00,12:10:00,12:15:00),
  format=%H:%M:%S))
 
  
  names(mygarmin)-c(Latitude,DateTime)
   mygarmin$tn - as.numeric(mygarmin$DateTime)
  
   # use 'findInterval'
   na.indx - which(is.na(myvscan$Latitude))  # find
  NAs
 
   # replace with garmin latitude
   myvscan$Latitude[na.indx] -
  mygarmin$Latitude[findInterval(myvscan$tn[na.indx],
  mygarmin$tn)]
  
  
   myvscan
LatitudeDateTime
  tn
 
  1  1.0 2009-05-22 12:00:00 1243008000
  2 30.0 2009-05-22 12:14:00 1243008840
  3  1.5 2009-05-22 12:20:00 1243009200
  
 
 
 
  On Fri, May 22, 2009 at 12:45 AM,
  Tim Clark mudiver1...@yahoo.com
  wrote:
 
 
  Dear List,
 
  I need some help in coming up with a function that will
  take two data sets, determine if a value is missing in one,
  find a value in the second that was taken at about the same
  time, and substitute the second value in for where the first
  should have been.  My problem is from a fish tracking
  study.  We put acoustic tags in fish and track them for
  several days.  Location data is supposed to be
  automatically recorded every time we detect a
  ping from the fish.  Unfortunately the GPS had
  some problems and sometimes the fishes depth was recorded
  but not its location.  I fortunately had a back-up GPS that
  was taking location data every five minutes.  I would like
  to merge the two files, replacing the missing value in the
  vscan (automatic) file with the location from the garmin
  file.  Since we were getting vscan records every 1-2
  seconds and garmin records every 5 minutes, I need to find
  the right place in the vscan file to place the garmin record
  - i.e. the
 
   closest in time, but not greater than 5 minutes.  I have
  written a function that does this. However, it works with my
  test data but locks up my computer with my real data.  I
  have several million vscan records and several thousand
  garmin records.  Is there a better way to do this?
 
 
 
  My function and test data:
 
 
 myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00)))
  names(myvscan)-c(Latitude,DateTime)
 
 
 mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00)))
  names(mygarmin)-c(Latitude,DateTime)
 
  minute.diff-1/24/12   #Time diff is in days, so this
  is 5 minutes
 
  for (k in 1:nrow(myvscan))
  {
  if (is.na(myvscan$Latitude[k]))
  {
  if ((min(abs(mygarmin$DateTime-myvscan

[R] Need a faster function to replace missing data

2009-05-21 Thread Tim Clark

Dear List,

I need some help in coming up with a function that will take two data sets, 
determine if a value is missing in one, find a value in the second that was 
taken at about the same time, and substitute the second value in for where the 
first should have been.  My problem is from a fish tracking study.  We put 
acoustic tags in fish and track them for several days.  Location data is 
supposed to be automatically recorded every time we detect a ping from the 
fish.  Unfortunately the GPS had some problems and sometimes the fishes depth 
was recorded but not its location.  I fortunately had a back-up GPS that was 
taking location data every five minutes.  I would like to merge the two files, 
replacing the missing value in the vscan (automatic) file with the location 
from the garmin file.  Since we were getting vscan records every 1-2 seconds 
and garmin records every 5 minutes, I need to find the right place in the vscan 
file to place the garmin record - i.e. the
 closest in time, but not greater than 5 minutes.  I have written a function 
that does this. However, it works with my test data but locks up my computer 
with my real data.  I have several million vscan records and several thousand 
garmin records.  Is there a better way to do this?


My function and test data:

myvscan-data.frame(c(1,NA,1.5),times(c(12:00:00,12:14:00,12:20:00)))
names(myvscan)-c(Latitude,DateTime)
mygarmin-data.frame(c(20,30,40),times((12:00:00,12:10:00,12:15:00)))
names(mygarmin)-c(Latitude,DateTime)

minute.diff-1/24/12   #Time diff is in days, so this is 5 minutes
for (k in 1:nrow(myvscan))  
{
if (is.na(myvscan$Latitude[k]))
{
if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k])))  minute.diff )
{
index.min.date-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k]))
myvscan$Latitude[k]-mygarmin$Latitude[index.min.date] 
}}}

I appreciate your help and advice.

Aloha,

Tim




Tim Clark
Department of Zoology 
University of Hawaii

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.