I'm having a problem where I have to apply a function to a subset of a variable, where the subset is defined by the n nearest neighbours of a second variable.
Here's an example applied to the 'iris' dataset: $ head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa For each row, I look at the value of Sepal.Length. I then figure out the n rows where the value of Sepal.Length is closest to that in the original row, and apply a function on the values of Sepal.Width to these rows (typically returning a scalar). For example, setting n = 5 and calculcating the mean on a slightly modified dataset, based on the first row (Sepal.Length ~= 5.1): $ set.seed(1) $ iris[,1:4]=iris[,1:4]+runif(150)/100 $ x=iris$Sepal.Length[1] $ (pos=which(order(abs(iris$Sepal.Length-x)) %in% 2:6)) [1] 18 26 40 42 52 $ mean(iris$Sepal.Width[pos]) [1] 3.086595 Now, I could easily use a 'for' loop or 'sapply' to do this for all rows, but I would think there is a better (and perhaps even faster?) way. Anyone know of a specific function in a package for this sort of thing? Also note that this way of doing it won't necessarily work on the unmodified dataset, where a number of rows have the same values for 'Sepal.Length', and the original row won't necessarily have 'order' value equal to 1. (Exactly how to break ties when there are more than n number of observations with the same distance to the original row isn't very important, though. For example, using the ones with lowest row numbers would be an OK solution, or n random ones, would both OK.) -- Karl Ove Hufthammer ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.