Dear Forrest,
Thanks a lot for your tip. I think quantregForest is what we were
looking for. It takes much more time to compute, but the method looks
sound
(http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf). I do
simplify everything on the end and assume that I can derive upper and
lower confidence limits for +/- 1 s.d. (0.15866, 1-0.15866) and then use
this as the prediction variance, but this is probably as good as it
goes. Here is the revised code:
https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R
Thank you all for your suggestions / opinions (very useful as usual).
cheers,
T. (Tom) Hengl
Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm
Network: http://profiles.google.com/tom.hengl
Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ
On 23/06/2013 15:08, Forrest Stevens wrote:
Hi Tom, I've done something similar in the past to visualize the
distribution of the predictions attained for each observation across
the many trees within a random forest while looking at various aspects
of those ranges and correlating that with cross-validated prediction
errors. It's relatively easy to generate and keep the predictions for
every tree for each observation (pixel in your case) using the
predict.all=TRUE argument:
predictions <- predict(random_forest, newdata=x_data_new, predict.all=TRUE)
Then to extract all of the individual trees' predictions for the first
observation:
predictions$individual[1]
You can do this to get the mean and SD for each observation (note the
mean should match the value in predictions$aggregate:
y_data$rf_mean <- apply(predictions$individual, MARGIN=1, mean)
y_data$rf_sd <- apply(predictions$individual, MARGIN=1, sd)
y_data$rf_cv <- apply(predictions$individual, MARGIN=1, sd)
In practice I've found during testing that the distribution of values
(assuming the continuous regression case since you're looking at SD in
the first place) is highly skewed. The range, SD, CV and other
measures of distribution of the individual trees does not correlate
well at all with prediction errors in my work. I kind of makes
intuitive sense since the power of the random forest algorithm relies
in the ensemble nature of the technique, and the randomness injected
via variable sampling at each node and those measures of variation in
the predictions I've looked at quickly become irrelevant as you scale
up the number of trees in the forest. So your mileage may vary but
I'd be interested to know what you find.
You may also want to look at the excellent quantregForest package as
it produces a randomForest object but also produces information on the
quantiles and quantile range for each observation's prediction for
you, including some nice plots that I've found useful.
Sincerely,
Forrest
On Sun, Jun 23, 2013 at 5:51 AM, Tomislav Hengl
<[email protected]> wrote:
Dear list,
I have a question about the randomForest models. I'm trying to figure out a
way to estimate the prediction variance (spatially) for the randomForest
function (http://cran.r-project.org/web/packages/randomForest/).
If I run a GLM I can also derive the prediction variance using:
demo(meuse, echo=FALSE)
meuse.ov <- over(meuse, meuse.grid)
meuse.ov <- cbind(meuse.ov, meuse@data)
omm0 <- glm(log1p(om)~dist+ffreq, meuse.ov, family=gaussian())
om.glm <- predict.glm(omm0, meuse.grid, se.fit=TRUE)
str(om.glm)
List of 3
$ fit : Named num [1:3103] 2.34 2.34 2.32 2.29 2.34 ...
..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
$ se.fit : Named num [1:3103] 0.0491 0.0491 0.0481 0.046 0.0491 ...
..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
$ residual.scale: num 0.357
when I fit a randomForest model, I do not get any estimate of the model
uncertainty (for each pixel) but just the predictions:
meuse.ov <- meuse.ov[-omm0$na.action,]
x <- randomForest(log1p(om)~dist+ffreq, meuse.ov)
om.rf <- predict(x, meuse.grid)
str(om.rf)
Named num [1:3103] 2.49 2.49 2.51 2.44 2.49 ...
- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ...
Does anyone has an idea how to map the prediction variance (i.e. estimated
or propagated error) for the randomForest models spatially?
I've tried deriving a propagated error for the randomForest models (every
fit gives another model due to random component):
l.rfk <- data.frame(om_1 = rep(NA, nrow(meuse.grid)))
for(i in 1:50){
+ suppressWarnings(suppressMessages(x <-
randomForest(log1p(om)~dist+ffreq, meuse.ov)))
+ l.rfk[,paste("om",i,sep="_")] <- predict(x, meuse.grid)
+ } ## takes ca 1 minute
meuse.grid$om.rfkvar <- om.rfk@predicted$var1.var + apply(l.rfk, 1, var)
but the prediction variance I get is rather small (much smaller than e.g.
the GLM variance). Here is the complete code with some plots:
R code:
https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R
Predictions UK vs randomForest-kriging:
https://gsif.googlecode.com/svn/trunk/meuse/Fig_meuse_RK_vs_RFK.png
thanx,
T. (Tom) Hengl
Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm
Network: http://profiles.google.com/tom.hengl
Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ
_______________________________________________
R-sig-Geo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
_______________________________________________
R-sig-Geo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo