Re: [R-sig-eco] multivariate regression tree analysis with mvpart - De'Ath's example

Gavin Simpson Wed, 12 Jan 2011 03:44:31 -0800

On Tue, 2011-01-11 at 20:02 -0800, Mike Marsh wrote:
> All,
> I am trying to replicate figure 2 in Glenn De'Ath's 2002 report, Ecology 
> 83:1105-1117
> The figure caption says that data were spes standardized andsite 
> standardized to the same mean, and that euclidean distance was usedin 
> the analysis.
> here is my script, using the "spider" data in the mvpart package:
> 
> I tried two different standardizing functions, which do not produce the 
> same scaling result:


I see what is happening. Glenn isn't clear what he means by
standardising to same mean. I initially took that to mean centring to
mean 0. What he appears to mean is to standardise to a mean of 1.

The reason your first version *didn't* work, was because you did
effectively this:

## MVpart Q
require(mvpart)

data(spider)
spp <- data.matrix(spider[,1:12])
env <- spider[,-c(1:12)]
mvpart(spp ~ ., data = env, xv = "pick")

set.seed(1)
colMu <- colMeans(spp)
spp2 <- sweep(spp, 2, colMu, "-")
rowMu <- rowMeans(spp2)
spp2 <- sweep(spp2, 1, rowMu, "-")
mvpart(spp2 ~ ., data = env)

Note the "-" in my sweep calls. This is the default so you didn't
include it. This gives mean centered data with colMeans() and rowMeans()
of (effectively, this is a computer remember) zero:

> head(rowMeans(spp2))
[1]  0.000000e+00  3.700743e-17 -2.312965e-17  9.251859e-18  7.401487e-17
[6]  0.000000e+00
> head(colMeans(spp2))
    arct.lute     pard.lugu     zora.spin     pard.nigr     pard.pull 
-6.938894e-18 -7.930164e-18 -2.379049e-17  5.179389e-17  1.982541e-17 
    aulo.albi 
-5.055480e-17

What Glenn's scalar() function does is effectively this (for "mean1"):

set.seed(1)
colMu <- colMeans(spp)
spp3 <- sweep(spp, 2, colMu, "/")
rowMu <- rowMeans(spp3)
spp3 <- sweep(spp3, 1, rowMu, "/")
mvpart(spp3 ~ ., data = env)

And we note that this does give the same tree - well similar, this one
has an extra branch pruned off but just reflects instability of the CV
results for such a small data set.

So why did you get a different result with scalar()? You just weren't
careful enough to apply scalar() to the *species* data only. You did:

spider.std<- scaler(spider, col="mean1", row="mean1")

and spider contains both the env and species data, so the row means used
the centre the data include values of the env data, and hence the
difference. This is not what you want. This is probably what was
intended:

spider.std <- scaler(spider[, 1:12], col="mean1", row="mean1")
spider.mrt <- mvpart(spider.std.dat[1:12] ~ herbs + reft + moss + sand +
twigs + water, spider)

> so I have not been able to reproduce De'Ath's result, and
> there are at least 3 different configurations of the regression tree for 
> spider data, depending on how it is calculated.

I think Glenn's results are reproducible *if* you run multiple CV (i.e.
run mvpart many times and aggregate up the size of trees and see which
size is selected most often). But even then, the deviations are only in
whether lower branches are pruned or retained *if* you apply the
standardisation used by Glenn (but not clearly stated).

> So, can I obtain consistent results using mvpart with my own data, in 
> order to compare them to the clusters that I obtained with hclust?

Yes, *if* (and this is a big IF) you have large data set or a data set
that has sufficiently large signal to noise ratio to cancel out the
noise introduced by CV a smaller data set.

Tree's really need 100's -1000's of observations to work properly, and
even then you'd need a good signal to noise ratio.

There is a problem with trees, and MRT does nothing to get round this;
namely one of stability. A change in the input data slightly /can/ have
large implications for the tree grown. Note this is a different issue to
the one you report, which is reproducing someone else's analysis. The
point is whether the results are "consistent"? We know trees are
unstable (have high variance). If you want to know the effects of
sampling and try to reduce the variance of the fitted model, then
bagging or random forests, or even boosting, are ways to go. Boosting is
likely to do better than either bagging or random forests because it
tries to reduce the bias of the model as well as variance.

The problem here though, is that none of the bagging, random forest or
boosting code I am aware of in R will work with multivariate trees
outright. You *might* be able to use code in the party package, but it
isn't fitting the same type of trees. You *might* be able to use the
ipred package and function bagging() contained therein because it calls
rpart() and Glenn's package provides a modified version of rpart. But
you will have to make sure ipred can deal with multivariate responses,
and whether you can pass all the required arguments through to mvpart.
You may not need the latter, but I would urge extreme caution in trying
it and make sure you check it is doing the right thing; ipred knows
nothing about Glenn's modified rpart so is not guaranteed to work.

You could always do this by hand but that will involve some coding
effort on your part; bootstrapping is easy, it is storing the results
and then aggregating over the ensemble of trees that is tricky, plus the
interpretation of the model afterwards, which will require variable
importance measures and partial plots. That is where things get more
involved...

HTH

G

> with thanks,
> Mike Marsh
> acting Conservation Chair,
> Washington Native Plant Society
> 
> 
> 
>       [[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Re: [R-sig-eco] multivariate regression tree analysis with mvpart - De'Ath's example

Reply via email to