Here is what I did; convert the data to a numeric matrix for faster processing. You can convert back to a dataframe since you have the indices into the levels for the flights and runways.
> # read in data > source('/temp/df/df') > # convert to matrix > df.mat <- cbind(pt = as.numeric(df$PredTime) + , dt = as.numeric(df$dt) + , rw = as.numeric(df$lrw) # index into 'levels' + , flight = as.numeric(df$flightfact) + ) > # create a list of row numbers for each flight for processing > flgt.list <- split(seq(nrow(df.mat)), df.mat[, 'flight']) > # remove lists with only 1 entry > flgt.list <- flgt.list[sapply(flgt.list, length) > 1] > > # create the interval we want data for > interval <- as.numeric(0:60) > > # now process the flights > times <- lapply(flgt.list, function(.flt){ + interp <- approx(df.mat[.flt, 'pt'] + , df.mat[.flt, 'dt'] + , xout = interval + , rule = 1 + ) + # return vector + cbind(time = interp$x + , error = interp$y + , runway = df.mat[.flt[1L], 'rw'] + , flight = df.mat[.flt[1L], 'flight'] + ) + }) > # sample output -- is this correct? > times[[1]] time error runway flight [1,] 0 NA 2 1 [2,] 1 NA 2 1 [3,] 2 -0.13795380 2 1 [4,] 3 -0.20726073 2 1 [5,] 4 -0.27309237 2 1 [6,] 5 -0.33333333 2 1 [7,] 6 -0.09322419 2 1 [8,] 7 0.14688495 2 1 [9,] 8 0.38699409 2 1 [10,] 9 0.62710323 2 1 [11,] 10 0.86721237 2 1 [12,] 11 1.10732151 2 1 [13,] 12 1.34743065 2 1 [14,] 13 1.58753979 2 1 [15,] 14 1.82764893 2 1 [16,] 15 2.06775807 2 1 [17,] 16 2.30786721 2 1 [18,] 17 2.54797635 2 1 [19,] 18 6.66600000 2 1 [20,] 19 4.82600000 2 1 [21,] 20 3.00436508 2 1 [22,] 21 2.22316562 2 1 [23,] 22 1.34895178 2 1 [24,] 23 0.47473795 2 1 [25,] 24 -0.39947589 2 1 [26,] 25 -1.27368973 2 1 [27,] 26 -2.12478632 2 1 [28,] 27 -1.61196581 2 1 [29,] 28 -1.09914530 2 1 [30,] 29 -0.58632479 2 1 [31,] 30 -0.07350427 2 1 [32,] 31 0.43931624 2 1 [33,] 32 0.95213675 2 1 [34,] 33 1.46495726 2 1 [35,] 34 1.97777778 2 1 [36,] 35 2.49059829 2 1 [37,] 36 3.00341880 2 1 [38,] 37 3.51623932 2 1 [39,] 38 4.02905983 2 1 [40,] 39 4.54188034 2 1 [41,] 40 5.05470085 2 1 [42,] 41 5.53360434 2 1 [43,] 42 5.53766938 2 1 [44,] 43 5.54173442 2 1 [45,] 44 5.54579946 2 1 [46,] 45 5.54986450 2 1 [47,] 46 5.55392954 2 1 [48,] 47 5.55799458 2 1 [49,] 48 5.56205962 2 1 [50,] 49 5.56612466 2 1 [51,] 50 5.57018970 2 1 [52,] 51 5.57425474 2 1 [53,] 52 5.57831978 2 1 [54,] 53 5.58238482 2 1 [55,] 54 5.58644986 2 1 [56,] 55 5.59051491 2 1 [57,] 56 5.59457995 2 1 [58,] 57 5.59864499 2 1 [59,] 58 5.60271003 2 1 [60,] 59 5.60677507 2 1 [61,] 60 5.61084011 2 1 On Sun, Jul 17, 2011 at 6:58 PM, James Rome <jamesr...@gmail.com> wrote: > I thought I had included the data... Here it is again. > > What I want to do is to make box and whisker plots with each flight > counted the same number of times in each time bin. Hence the > interpolation to minute time hacks. > > > On 7/17/2011 4:16 PM, jim holtman wrote: >> It would be nice if you had some sample data included so that we could >> see how the code worked. Have you use Rprof on the code to see where >> you are spending your time? You might want to use 'matrix' instead of >> 'data.frames' since there is a big performance impact with dataframes >> when indexing. A little more description of the problem you are >> trying to solve would also be useful. I tend to ask people "tell me >> what you want to do, not how you want to do it". >> >> On Sun, Jul 17, 2011 at 1:30 PM, James Rome <jamesr...@gmail.com> wrote: >>> df is a very large data frame with arrival estimates for many flights >>> (DF$flightfact) at random times (df$PredTime). The error of the estimate >>> is df$dt. >>> My problem is that I want to know the prediction error at each minute >>> before landing. This code works, but is very slow, and dominates >>> everything. I tried using split(), but that rapidly ate up my 12 GB of >>> memory. So, is there a better R way of doing this? >>> >>> Thanks, >>> Jim Rome >>> >>> flights = table(df$flightfact[1:dim(df)[1], drop=TRUE]) >>> nflights = length(flights) >>> flights = as.data.frame(flights) >>> times = data.frame() >>> # Split by flight >>> for(i in 1:nflights) { >>> tf = df[as.numeric(df$flightfact)==flights[i,1],] # This flight >>> #check for at least 2 entries >>> if(dim(tf)[1] < 2) { >>> next >>> } >>> idf = interpolateTimes(tf) >>> times = rbind(times, idf) >>> } >>> >>> # Interpolate the times to every minute for 60 minutes >>> # Return a new data frame >>> interpolateTimes = function(df) { >>> x = as.numeric(seq(from=0,to=60)) # The times to interpolate to >>> dti = approx(as.numeric(df$PredTime), as.numeric(df$dt), x, >>> method="linear",rule=1:1) >>> # Make a new data frame of interpolated values >>> idf = data.frame(time=dti$x, error=dti$y, >>> runway=rep(df$lrw[1],length(dti$x)), >>> flight=rep(df$flightfact[1], length(dti$x))) >>> return(idf) >>> } >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> > > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.