Re: [R-sig-phylo] combining p-values in PCMs

Liam J. Revell Tue, 02 Apr 2013 16:58:24 -0700

Hi Vladimir.

I'm not clear on the source of your misunderstanding, so let me do my best.

1) "I think that's right" is casual shorthand in English. In this case Ijust mean that I think that this method will give us the correctvariance on beta (but I'm not sure - hence 'I think'). It certainlyseems better than computing the variance on the estimator from thesingle best tree or consensus tree, which ignores phylogeneticuncertainty; or from the variance in beta across trees, which ignoressampling error in the estimation of beta on any tree. It seems unlikelythat this would dramatically underestimate the variance in beta.

2) The null hypothesis here is the slope of our fitted model (beta) iszero, but I don't think we need to focus specifically on null hypothesistesting. What we're really interested in is getting a "true" (or atleast better) estimate of the uncertainty in the model coefficient thattakes into account the different sources of error in our data & tree. Ijust suggested a t-test because people are often interested in getting ap-value from their fitted model and the p-value from a single consensusor "best" tree will surely be too small.


I hope that seems reasonable to you.

All the best, Liam

Liam J. Revell, Assistant Professor of Biology
University of Massachusetts Boston
web: http://faculty.umb.edu/liam.revell/
email: liam.rev...@umb.edu
blog: http://blog.phytools.org

On 4/2/2013 7:27 PM, Vladimir Minin wrote:

Hi Liam,

When you say "I think that's right," I am not sure I understand what you
mean. In particular, what is your null hypothesis stated mathematically
and what frequentist guarantees do you have for the p-value you are
computing? Typically, we desire the uniform distribution of p-values
under the null. This would be easy to check by simulation, but this
simulation would require to generate both sequence and non-sequence
trait data, in other words we would need to decide what exactly the null
hypothesis is here.

I am suspicious of your approach, because, in general, Neyman-Pearson
hypothesis testing framework has hard time accommodating uncertainty in
nuisance parameters (phylogenies in this case).

Unfortunately, I don't know if a rigorous answer to the original
question posed by John exists, so I cannot suggest a practical
alternative other than building a large Bayesian model that encompasses
uncertainty about all its parameters. In that case, by the way, testing
hypotheses still remains infeasible, because Bayesian framework lacks
tools for testing model fit in the absence of an alternative (posterior
predictive model checks are exceptions, but their
interpretation/calibration is questionable anyway), meaning that
Bayesians can only compare models, but cannot examine the fit of one
(null) model in isolation.

-Vladimir

Vladimir Minin

Assistant Professor
Department of Statistics
University of Washington, Seattle
Padelford Hall C-315, Box 354322
Seattle, WA 98195-4322
telephone: 206-543-4302
web: www.stat.washington.edu/vminin <http://www.stat.washington.edu/vminin>


On Apr 1, 2013, at 9:04 AM, Liam J. Revell wrote:

Hi John & list.

Can I add to this by suggesting (in case it is not clear from Joe's
comment) that we should probably combine the variance among estimates
obtained from different trees in the posterior sample with the
sampling variance of any single estimate?

One fairly sensible way to do this is to compute the variance due to
phylogenetic uncertainty as the variance among estimates obtained from
the trees of the posterior sample; and then to compute the sampling
variance as the mean variance of the estimator from each tree; and
then add the two variances them. The standard error of our estimate
(computed as the mean across trees) is the square-root of this
variance. To conduct a hypothesis test on the regression coefficient,
then, you would compute the mean across trees and then the ratio of
the parameter and its standard error should have a t-distribution with
n-2 degrees of freedom for n taxa (not contrasts).

To do this from a practical perspective from trees in 'multiPhylo'
object (here "trees") and data in x & y for a simple bivariate
regression, we do the following:

# first define the following custom function
ff<-function(tree,x,y){
pic.x<-pic(x,tree)
pic.y<-pic(y,tree)
fit<-lm(pic.y~pic.x-1)
setNames(c(coef(fit),vcov(fit)),c("beta","var(beta)"))
}
# now apply to all trees in your sample
BB<-t(sapply(trees,ff,x,y))
# total variance in beta estimated by
varBeta<-var(BB[,"beta"])+mean(BB[,"var(beta)"])
t.beta<-mean(BB[,"beta"])/sqrt(varBeta)
P.beta<-2*pt(abs(t.beta),df=length(trees[[1]]$tip)-2,lower.tail=FALSE)

I think that's right.

All the best, Liam

Liam J. Revell, Assistant Professor of Biology
University of Massachusetts Boston
web: http://faculty.umb.edu/liam.revell/
email: liam.rev...@umb.edu <mailto:liam.rev...@umb.edu>
blog: http://blog.phytools.org

On 4/1/2013 11:10 AM, Joe Felsenstein wrote:


John D asked:

Given that things have been quiet in the list lately, I think this
could be a good time for me to ask your opinion about this issue.

Imagine that I run a PIC analysis on two traits using 1000
post-burn-in trees. What would be the best way to summarize these
results? Average p-values across all analyses? Perhaps a specific
method to combine the resulting probabilities e.g. Fisher's test?


If you want to know about a regression coefficient, or other
parameter, the 1000 inferred values in the 1000 trees can be
taken as a posterior distribution if the trees come from a
Bayesian inference (I am assuming from your description that
they do).

If you want to test whether, say, the regression coefficient is
positive, you can take this posterior distribution and simply
ask whether the credible interval for the parameter includes
zero. In this case that interval would be the upper 95% of the
sample values.

Bringing in P values or Fisher's test just mixes Bayesian with
frequentist or likelihood methods, which will get two groups of
people annoyed at you, whereas using only one of these methods
has the advantage of irritating only one group of us.


I note that John's email address is "dobzhanski". Good to hear
from you again, Professor Dobzhansky. You may recall that you
gave me a tour of your lab at Rockefeller University in the
summer of 1963 when Dick Lewontin sent me from Rochester to
pick up some data paperwork for him for a joint project the two
of you were working on.

(Okay, the last part is "Poisson d'Avrile")

----
Joe Felsenstein, j...@gs.washington.edu <mailto:j...@gs.washington.edu>
 Dept. of Genome Sciences, Univ. of Washington
 Box 355065, Seattle, WA 98195-5065 USA

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
<mailto:R-sig-phylo@r-project.org>
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/


_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
<mailto:R-sig-phylo@r-project.org>
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/


_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] combining p-values in PCMs

Reply via email to