Re: [R-sig-phylo] Partitioning phylogentic and environment effect on trait distribution

2015-01-22 Thread Jarrod Hadfield

Hi,

You might try looking at:

A general and simple method for obtaining R2 from generalized linear  
mixed-effects models by Nakagawa  Schielzeth.


They detail some of the issues with comparing null versus full models,  
and it is a well presented paper.


I can't remember whether they deal with the fact that the standard  
estimator for the R^2 is upwardly biased, as is the variance explained  
by the fixed effects. It is easier to see this from a frequentist  
point of view:  imagine a covariate x with an estimated regression  
coefficient bhat. bhat = b + d where b is the true coefficient and d  
is the sampling deviation.


The variance explained by x is  VAR(x*b) = VAR(x)b^2 but  
E[VAR(x*bhat)] (where the expectation is taken over possible  
estimates, bhat) is
VAR(x)b^2+VAR(x)VAR(d). The estimated R^2 is upwardly biased by  
uncertainty in the fixed effects: VAR(x)VAR(d) is always positive. Its  
not that surprising: under least squares bhat is the value that  
explains most of the variance in the data: if b differs from bhat, b  
must explain less.


Cheers,

Jarrod

The derivation for the above is:

 E[VAR(x*bhat)] = E[VAR(x*(b+d))]
= E[VAR(x)(b+d)^2]
= E[VAR(x)(b^2+2bd+d^2)]
= VAR(x)(b^2+VAR(d))

d^2 = VAR(d) and E[2bd]=0 for unbiased estimators because E[d]=0.















Quoting Gustaf Granath gustaf.gran...@gmail.com on Thu, 22 Jan 2015  
13:17:58 +0100:



Hi
Va/(Va+Ve) is the proportion of variance explained by phylogeny  
*after* accounting for the fixed effects. To include the variance  
explained by the fixed effects in the denominator is a bit  
difficult, because you have to  assume something about the  
distribution of predictors. For example do you want the different  
habitat types to be equally represented, or at the frequency with  
which they were sampled?
I see. I thought I could just look at the change in Va with or  
without the fixed effect (full versus null model). The biological  
question is if the observed trait variation is mainly a result of  
phylogeny or the environment, hence the focus on variance explained  
by the different components. I must admit that im not following why  
the distribution of the predictor is important (or what kind of  
approach you are referring to), but I guess that frequency with  
which they were sampled is most appropriate for the question. This  
should be easier for a continuous predictor, right?


Cheers

Gustaf

On 2015-01-22 12:23, Jarrod Hadfield wrote:

Hi Gustaf,

In the model with just species the residual variation is  
measurement error/plasticity error, but could also include  
deviations from the assumed BM process. If you add species.ide that  
term captures deviations from the assumed BM process.


Va/(Va+Ve) is the proportion of variance explained by phylogeny  
*after* accounting for the fixed effects. To include the variance  
explained by the fixed effects in the denominator is a bit  
difficult, because you have to  assume something about the  
distribution of predictors. For example do you want the different  
habitat types to be equally represented, or at the frequency with  
which they were sampled?


Cheers,

Jarrod


Quoting Gustaf Granath gustaf.gran...@gmail.com on Thu, 22 Jan  
2015 11:11:02 +0100:



Jarrod,
Thanks for your input. Dropping species.ide result in a model that  
is estimated without any problems.
prior = list(R = list(V = 1, nu = 0), G = list(G1 = list(V = 1, nu  
= 1, alpha.mu = 0, alpha.V = 1000)))

m1.mcmc - MCMCglmm(y ~ habitat, random =~ species,
   data = traits, ginverse = list(species = Ainv),  
nitt = 70, thin=100, prior = prior)
It seems to be rather robust (not sensitive to priors) and Im not  
sure I understand why you would need 100s of species.


But how is this model related to the simpler approach I described  
in my first post?
The error term (Ve) is the within species error  
(measurement/plasticity error) and the species variance (Va) is  
the phylogenetic variance in my data. Lambda (Pagel's) is then  
Va/(Va+Ve).
(m1.mcmc$VCV[,species] / (m1.mcmc$VCV[,species] +  
m1.mcmc$VCV[,units]))

# BUT is this only correct under a model without habitat?

If habitat is included in the model the between species variation  
not explained by phylogeny or habitat is included in Ve because I  
dropped the speceis.ide, or am I missing something here?


If I run a model with and without habitat I can extract the amount  
of variation explained by habitat (differences in Ve = Vhab) and  
get a ballpark number on the relative contribution to the pattern  
we observe:

Va / (Va+Vhab+Ve) #phylo
Vhab / (Va+Vhab+Ve) #habitat
Ve / (Va+Vhab+Ve) #measurement/plasticity/local adaption  and  
other processes


Did I get that right or am I lost?

Gustaf


On 2015-01-22 04:54, Jarrod Hadfield wrote:

Hi Gustaf,

1/ You can ignore nhabitat: for some reason I thought you thought  
that there might be systematic differences between 

Re: [R-sig-phylo] Partitioning phylogentic and environment effect on trait distribution

2015-01-22 Thread Jarrod Hadfield

Hi Gustaf,

In the model with just species the residual variation is measurement  
error/plasticity error, but could also include deviations from the  
assumed BM process. If you add species.ide that term captures  
deviations from the assumed BM process.


Va/(Va+Ve) is the proportion of variance explained by phylogeny  
*after* accounting for the fixed effects. To include the variance  
explained by the fixed effects in the denominator is a bit difficult,  
because you have to  assume something about the distribution of  
predictors. For example do you want the different habitat types to be  
equally represented, or at the frequency with which they were sampled?


Cheers,

Jarrod






Quoting Gustaf Granath gustaf.gran...@gmail.com on Thu, 22 Jan 2015  
11:11:02 +0100:



Jarrod,
Thanks for your input. Dropping species.ide result in a model that  
is estimated without any problems.
prior = list(R = list(V = 1, nu = 0), G = list(G1 = list(V = 1, nu =  
1, alpha.mu = 0, alpha.V = 1000)))

m1.mcmc - MCMCglmm(y ~ habitat, random =~ species,
data = traits, ginverse = list(species = Ainv),  
nitt = 70, thin=100, prior = prior)
It seems to be rather robust (not sensitive to priors) and Im not  
sure I understand why you would need 100s of species.


But how is this model related to the simpler approach I described in  
my first post?
The error term (Ve) is the within species error  
(measurement/plasticity error) and the species variance (Va) is the  
phylogenetic variance in my data. Lambda (Pagel's) is then Va/(Va+Ve).

(m1.mcmc$VCV[,species] / (m1.mcmc$VCV[,species] + m1.mcmc$VCV[,units]))
# BUT is this only correct under a model without habitat?

If habitat is included in the model the between species variation  
not explained by phylogeny or habitat is included in Ve because I  
dropped the speceis.ide, or am I missing something here?


If I run a model with and without habitat I can extract the amount  
of variation explained by habitat (differences in Ve = Vhab) and get  
a ballpark number on the relative contribution to the pattern we  
observe:

Va / (Va+Vhab+Ve) #phylo
Vhab / (Va+Vhab+Ve) #habitat
Ve / (Va+Vhab+Ve) #measurement/plasticity/local adaption  and  
other processes


Did I get that right or am I lost?

Gustaf


On 2015-01-22 04:54, Jarrod Hadfield wrote:

Hi Gustaf,

1/ You can ignore nhabitat: for some reason I thought you thought  
that there might be systematic differences between specialists and  
non-specialists.


2/ species and species.ide are identical, but species effects are  
structured by the phylogeny due to the ginverse specification.  
species.ide are treated as iid.


3/ With few species and few habitats, and very few species observed  
in multiple habitats, I think you will just have to go for the  
simple model you describe (perhaps even dropping species.ide):


m1.mcmc-MCMCglmm(trait~as.factor(habitat)+x, random =~  
species+species.ide, data=my_data, ginverse=list(species=Ainv))


Expect the variance components to be poorly estimated and prior  
sensitive. 100's of species are required to get even moderately  
precise estimates of the phylogenetic variance.


Cheers,

Jarrod



Quoting Gustaf Granath gustaf.gran...@gmail.com on Thu, 22 Jan  
2015 01:15:23 +0100:



Jarrod,
I tried this but Im not sure it works well for my problem and/or  
Im doing something wrong.


First, Im not sure what nhabitat is. A vector specifying the  
number of habitats that the species is found in? I.e.  
my_data$nhabitat may look like:
1,1,1,1,2,2,2,2, 2,2,2,2  if the first two species occurs in  
one and two habitats, respectively.


Second, species.ide. If it is the same as species, why not just: +  
species + species ?

Should it be a vector identical to species but just names species.ide?

So to my data, there are only three habitats (3 levels), thus Im  
not sure it works well modeling it as a random factor (and as you  
indicate we need a reasonable number of habitats). There are 2,  
2 and 11 species in each habitat, and only a few are found in  
multiple habitats. So Im really looking at simpler model because  
there are not enough data for complicated structures. With so few  
species things are pushed to the limit but my guess is that the  
main problem here is that habitat cannot be fitted as a random  
factor. Possible to model it as a fixed effect?


However, there are other predictors of interest as well,  
continuous variables. How would that be modeled then?

Like this (predictor=x)?
m1.mcmc-MCMCglmm(trait~ x, random =~ species+species.ide,  
data=my_data, ginverse=list(species=Ainv))


I would also like to thank Tony for recommending excellent papers  
and I didnt know about the pez package which looks promising.


Cheers

Gustaf


On 2015-01-21 16:26, Jarrod Hadfield wrote:

Dear Gustaf,

How many levels of `habitat' are there, and are they  
cross-classified with respect to species (i.e. are multiple  
species measured in the same 

Re: [R-sig-phylo] Partitioning phylogentic and environment effect on trait distribution

2015-01-22 Thread Gustaf Granath

Jarrod,
Thanks for your input. Dropping species.ide result in a model that is 
estimated without any problems.
prior = list(R = list(V = 1, nu = 0), G = list(G1 = list(V = 1, nu = 1, 
alpha.mu = 0, alpha.V = 1000)))

m1.mcmc - MCMCglmm(y ~ habitat, random =~ species,
data = traits, ginverse = list(species = Ainv), 
nitt = 70, thin=100, prior = prior)
It seems to be rather robust (not sensitive to priors) and Im not sure I 
understand why you would need 100s of species.


But how is this model related to the simpler approach I described in my 
first post?
The error term (Ve) is the within species error (measurement/plasticity 
error) and the species variance (Va) is the phylogenetic variance in my 
data. Lambda (Pagel's) is then Va/(Va+Ve).
(m1.mcmc$VCV[,species] / (m1.mcmc$VCV[,species] + 
m1.mcmc$VCV[,units]))

# BUT is this only correct under a model without habitat?

If habitat is included in the model the between species variation not 
explained by phylogeny or habitat is included in Ve because I dropped 
the speceis.ide, or am I missing something here?


If I run a model with and without habitat I can extract the amount of 
variation explained by habitat (differences in Ve = Vhab) and get a 
ballpark number on the relative contribution to the pattern we observe:

Va / (Va+Vhab+Ve) #phylo
Vhab / (Va+Vhab+Ve) #habitat
Ve / (Va+Vhab+Ve) #measurement/plasticity/local adaption  and other 
processes


Did I get that right or am I lost?

Gustaf


On 2015-01-22 04:54, Jarrod Hadfield wrote:

Hi Gustaf,

1/ You can ignore nhabitat: for some reason I thought you thought that 
there might be systematic differences between specialists and 
non-specialists.


2/ species and species.ide are identical, but species effects are 
structured by the phylogeny due to the ginverse specification. 
species.ide are treated as iid.


3/ With few species and few habitats, and very few species observed in 
multiple habitats, I think you will just have to go for the simple 
model you describe (perhaps even dropping species.ide):


m1.mcmc-MCMCglmm(trait~as.factor(habitat)+x, random =~ 
species+species.ide, data=my_data, ginverse=list(species=Ainv))


Expect the variance components to be poorly estimated and prior 
sensitive. 100's of species are required to get even moderately 
precise estimates of the phylogenetic variance.


Cheers,

Jarrod



Quoting Gustaf Granath gustaf.gran...@gmail.com on Thu, 22 Jan 2015 
01:15:23 +0100:



Jarrod,
I tried this but Im not sure it works well for my problem and/or Im 
doing something wrong.


First, Im not sure what nhabitat is. A vector specifying the number 
of habitats that the species is found in? I.e. my_data$nhabitat may 
look like:
1,1,1,1,2,2,2,2, 2,2,2,2  if the first two species occurs in one 
and two habitats, respectively.


Second, species.ide. If it is the same as species, why not just: + 
species + species ?

Should it be a vector identical to species but just names species.ide?

So to my data, there are only three habitats (3 levels), thus Im not 
sure it works well modeling it as a random factor (and as you 
indicate we need a reasonable number of habitats). There are 2, 2 
and 11 species in each habitat, and only a few are found in multiple 
habitats. So Im really looking at simpler model because there are not 
enough data for complicated structures. With so few species things 
are pushed to the limit but my guess is that the main problem here is 
that habitat cannot be fitted as a random factor. Possible to model 
it as a fixed effect?


However, there are other predictors of interest as well, continuous 
variables. How would that be modeled then?

Like this (predictor=x)?
m1.mcmc-MCMCglmm(trait~ x, random =~ species+species.ide, 
data=my_data, ginverse=list(species=Ainv))


I would also like to thank Tony for recommending excellent papers and 
I didnt know about the pez package which looks promising.


Cheers

Gustaf


On 2015-01-21 16:26, Jarrod Hadfield wrote:

Dear Gustaf,

How many levels of `habitat' are there, and are they 
cross-classified with respect to species (i.e. are multiple species 
measured in the same habitat)?


Assuming for now there are a reasonable number of habitats then the 
simplest model (without cross-classification) in asreml/MCMCglmm 
would be


Ainv-inverseA(my_tree)$Ainv

m1.asreml-asreml(trait~nhabitat, random =~ 
habitat+giv(species)+species.ide, data=my_data, 
ginverse=list(species=sm2asreml(Ainv)))


m1.mcmc-MCMCglmm(trait~nhabitat, random =~ 
habitat+species+species.ide, data=my_data, ginverse=list(species=Ainv))


nhabitat is the number of habitats a species is found in. 
species.ide is the same as species and models the between species 
variation not captured by the BM model. Note that this requires you 
using the individual data rather than the species means (which is to 
be preferred).


With cross-classification you may want to have the effects of the 
same habitat vary across species. There's many