On Tue, 2011-01-11 at 20:02 -0800, Mike Marsh wrote: > All, > I am trying to replicate figure 2 in Glenn De'Ath's 2002 report, Ecology > 83:1105-1117 > The figure caption says that data were spes standardized andsite > standardized to the same mean, and that euclidean distance was usedin > the analysis. > here is my script, using the "spider" data in the mvpart package: > > I tried two different standardizing functions, which do not produce the > same scaling result:
I see what is happening. Glenn isn't clear what he means by standardising to same mean. I initially took that to mean centring to mean 0. What he appears to mean is to standardise to a mean of 1. The reason your first version *didn't* work, was because you did effectively this: ## MVpart Q require(mvpart) data(spider) spp <- data.matrix(spider[,1:12]) env <- spider[,-c(1:12)] mvpart(spp ~ ., data = env, xv = "pick") set.seed(1) colMu <- colMeans(spp) spp2 <- sweep(spp, 2, colMu, "-") rowMu <- rowMeans(spp2) spp2 <- sweep(spp2, 1, rowMu, "-") mvpart(spp2 ~ ., data = env) Note the "-" in my sweep calls. This is the default so you didn't include it. This gives mean centered data with colMeans() and rowMeans() of (effectively, this is a computer remember) zero: > head(rowMeans(spp2)) [1] 0.000000e+00 3.700743e-17 -2.312965e-17 9.251859e-18 7.401487e-17 [6] 0.000000e+00 > head(colMeans(spp2)) arct.lute pard.lugu zora.spin pard.nigr pard.pull -6.938894e-18 -7.930164e-18 -2.379049e-17 5.179389e-17 1.982541e-17 aulo.albi -5.055480e-17 What Glenn's scalar() function does is effectively this (for "mean1"): set.seed(1) colMu <- colMeans(spp) spp3 <- sweep(spp, 2, colMu, "/") rowMu <- rowMeans(spp3) spp3 <- sweep(spp3, 1, rowMu, "/") mvpart(spp3 ~ ., data = env) And we note that this does give the same tree - well similar, this one has an extra branch pruned off but just reflects instability of the CV results for such a small data set. So why did you get a different result with scalar()? You just weren't careful enough to apply scalar() to the *species* data only. You did: spider.std<- scaler(spider, col="mean1", row="mean1") and spider contains both the env and species data, so the row means used the centre the data include values of the env data, and hence the difference. This is not what you want. This is probably what was intended: spider.std <- scaler(spider[, 1:12], col="mean1", row="mean1") spider.mrt <- mvpart(spider.std.dat[1:12] ~ herbs + reft + moss + sand + twigs + water, spider) > so I have not been able to reproduce De'Ath's result, and > there are at least 3 different configurations of the regression tree for > spider data, depending on how it is calculated. I think Glenn's results are reproducible *if* you run multiple CV (i.e. run mvpart many times and aggregate up the size of trees and see which size is selected most often). But even then, the deviations are only in whether lower branches are pruned or retained *if* you apply the standardisation used by Glenn (but not clearly stated). > So, can I obtain consistent results using mvpart with my own data, in > order to compare them to the clusters that I obtained with hclust? Yes, *if* (and this is a big IF) you have large data set or a data set that has sufficiently large signal to noise ratio to cancel out the noise introduced by CV a smaller data set. Tree's really need 100's -1000's of observations to work properly, and even then you'd need a good signal to noise ratio. There is a problem with trees, and MRT does nothing to get round this; namely one of stability. A change in the input data slightly /can/ have large implications for the tree grown. Note this is a different issue to the one you report, which is reproducing someone else's analysis. The point is whether the results are "consistent"? We know trees are unstable (have high variance). If you want to know the effects of sampling and try to reduce the variance of the fitted model, then bagging or random forests, or even boosting, are ways to go. Boosting is likely to do better than either bagging or random forests because it tries to reduce the bias of the model as well as variance. The problem here though, is that none of the bagging, random forest or boosting code I am aware of in R will work with multivariate trees outright. You *might* be able to use code in the party package, but it isn't fitting the same type of trees. You *might* be able to use the ipred package and function bagging() contained therein because it calls rpart() and Glenn's package provides a modified version of rpart. But you will have to make sure ipred can deal with multivariate responses, and whether you can pass all the required arguments through to mvpart. You may not need the latter, but I would urge extreme caution in trying it and make sure you check it is doing the right thing; ipred knows nothing about Glenn's modified rpart so is not guaranteed to work. You could always do this by hand but that will involve some coding effort on your part; bootstrapping is easy, it is storing the results and then aggregating over the ensemble of trees that is tricky, plus the interpretation of the model afterwards, which will require variable importance measures and partial plots. That is where things get more involved... HTH G > with thanks, > Mike Marsh > acting Conservation Chair, > Washington Native Plant Society > > > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-ecology mailing list > R-sig-ecology@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology