Re: [R-sig-phylo] combining p-values in PCMs

2013-04-02 Thread Liam J. Revell

Hi Vladimir.

I'm not clear on the source of your misunderstanding, so let me do my best.

1) I think that's right is casual shorthand in English. In this case I 
just mean that I think that this method will give us the correct 
variance on beta (but I'm not sure - hence 'I think'). It certainly 
seems better than computing the variance on the estimator from the 
single best tree or consensus tree, which ignores phylogenetic 
uncertainty; or from the variance in beta across trees, which ignores 
sampling error in the estimation of beta on any tree. It seems unlikely 
that this would dramatically underestimate the variance in beta.


2) The null hypothesis here is the slope of our fitted model (beta) is 
zero, but I don't think we need to focus specifically on null hypothesis 
testing. What we're really interested in is getting a true (or at 
least better) estimate of the uncertainty in the model coefficient that 
takes into account the different sources of error in our data  tree. I 
just suggested a t-test because people are often interested in getting a 
p-value from their fitted model and the p-value from a single consensus 
or best tree will surely be too small.


I hope that seems reasonable to you.

All the best, Liam

Liam J. Revell, Assistant Professor of Biology
University of Massachusetts Boston
web: http://faculty.umb.edu/liam.revell/
email: liam.rev...@umb.edu
blog: http://blog.phytools.org

On 4/2/2013 7:27 PM, Vladimir Minin wrote:

Hi Liam,

When you say I think that's right, I am not sure I understand what you
mean. In particular, what is your null hypothesis stated mathematically
and what frequentist guarantees do you have for the p-value you are
computing? Typically, we desire the uniform distribution of p-values
under the null. This would be easy to check by simulation, but this
simulation would require to generate both sequence and non-sequence
trait data, in other words we would need to decide what exactly the null
hypothesis is here.

I am suspicious of your approach, because, in general, Neyman-Pearson
hypothesis testing framework has hard time accommodating uncertainty in
nuisance parameters (phylogenies in this case).

Unfortunately, I don't know if a rigorous answer to the original
question posed by John exists, so I cannot suggest a practical
alternative other than building a large Bayesian model that encompasses
uncertainty about all its parameters. In that case, by the way, testing
hypotheses still remains infeasible, because Bayesian framework lacks
tools for testing model fit in the absence of an alternative (posterior
predictive model checks are exceptions, but their
interpretation/calibration is questionable anyway), meaning that
Bayesians can only compare models, but cannot examine the fit of one
(null) model in isolation.

-Vladimir

Vladimir Minin

Assistant Professor
Department of Statistics
University of Washington, Seattle
Padelford Hall C-315, Box 354322
Seattle, WA 98195-4322
telephone: 206-543-4302
web: www.stat.washington.edu/vminin http://www.stat.washington.edu/vminin


On Apr 1, 2013, at 9:04 AM, Liam J. Revell wrote:


Hi John  list.

Can I add to this by suggesting (in case it is not clear from Joe's
comment) that we should probably combine the variance among estimates
obtained from different trees in the posterior sample with the
sampling variance of any single estimate?

One fairly sensible way to do this is to compute the variance due to
phylogenetic uncertainty as the variance among estimates obtained from
the trees of the posterior sample; and then to compute the sampling
variance as the mean variance of the estimator from each tree; and
then add the two variances them. The standard error of our estimate
(computed as the mean across trees) is the square-root of this
variance. To conduct a hypothesis test on the regression coefficient,
then, you would compute the mean across trees and then the ratio of
the parameter and its standard error should have a t-distribution with
n-2 degrees of freedom for n taxa (not contrasts).

To do this from a practical perspective from trees in 'multiPhylo'
object (here trees) and data in x  y for a simple bivariate
regression, we do the following:

# first define the following custom function
ff-function(tree,x,y){
pic.x-pic(x,tree)
pic.y-pic(y,tree)
fit-lm(pic.y~pic.x-1)
setNames(c(coef(fit),vcov(fit)),c(beta,var(beta)))
}
# now apply to all trees in your sample
BB-t(sapply(trees,ff,x,y))
# total variance in beta estimated by
varBeta-var(BB[,beta])+mean(BB[,var(beta)])
t.beta-mean(BB[,beta])/sqrt(varBeta)
P.beta-2*pt(abs(t.beta),df=length(trees[[1]]$tip)-2,lower.tail=FALSE)

I think that's right.

All the best, Liam

Liam J. Revell, Assistant Professor of Biology
University of Massachusetts Boston
web: http://faculty.umb.edu/liam.revell/
email: liam.rev...@umb.edu mailto:liam.rev...@umb.edu
blog: http://blog.phytools.org

On 4/1/2013 11:10 AM, Joe Felsenstein wrote:


John D asked:


Given that things have been quiet in 

Re: [R-sig-phylo] combining p-values in PCMs

2013-04-02 Thread Mark Holder
On Mon, Apr 1, 2013 at 10:10 AM, Joe Felsenstein j...@gs.washington.edu wrote:

 John D asked:
[snip]
 Imagine that I run a PIC analysis on two traits using 1000
 post-burn-in trees. What would be the best way to summarize these
 results? Average p-values across all analyses?

Joe's response
 [snip] simply
 ask whether the credible interval for the parameter includes
 zero. In this case that interval would be the upper 95% of the
 sample values.

 Bringing in P values or Fisher's test just mixes Bayesian with
 frequentist or likelihood methods, which will get two groups of
 people annoyed at you
[snip]

Hi all,
  I certainly agree with Joe's point about the mixture of frequentist
and Bayesian
perspectives inherent in calculating P values from trees from the
posterior.  As
Vladimir pointed out, it really is not easy to calculate P values
correctly when there
are nuisance parameters. I was recently reading up on this a bit, so I
thought I'd
pass on a reference (though I'm embarrassed to admit that I only learned of it
recently). A key paper seems to be:

Berger, Roger L., and Dennis D. Boos. P values maximized over a
confidence set
for the nuisance parameter. Journal of the American Statistical Association
89.427 (1994): 1012-1016.


Briefly, they recommend:

1. calculate a (1-beta)*100% confidence set for the nuisance
parameter (so beta= 0.01 for a 99% confidence interval on
trees, for example),

2. calculate the P value for every value of the nuisance parameter
in the set

3. report the P value for the test as:
P = beta + the largest P value observed


Note that there is the difference between the confidence set in the Berger
and Boos approach and what the credible set that you'd get out of a Bayesian
MCMC run (as Joe pointed out).

Also note that the P value from Berger and Boos could be a lot higher than
the average P value over the credible set (if your test statistic is close to
being pivotal then there won't be much variability in the P value
from one tree
to another, so the mean the max won't be that different). The P value
will always
be higher than the mean P value by at least the value that you choose for beta.

Reporting such a high P value seems frustrating, but it is in line
with the frequentist
tradition of considering the least favorable configuration of the null
- the null that is
hardest to reject.

It is difficult to calculate a something like a 99% confidence set for
trees, so I'm not
aware of anyone using the Berger and Boos method in phylogenetics.

Plus, dealing with uncertainty in the branch lengths is another kettle of fish.

It seems like the alternative to the Berger and Boos recipe is some
form of parametric
bootstrapping. You may also be interested in:

  Demortier, Luc. P-values and nuisance parameters. Proceedings of
PHYSTAT 2007 (2007): 23-33.


all the best,
Mark
-- 
Mark Holder

mthol...@gmail.com
mthol...@ku.edu
http://phylo.bio.ku.edu/mark-holder

==
Department of Ecology and Evolutionary Biology
University of Kansas
6031 Haworth Hall
1200 Sunnyside Avenue
Lawrence, Kansas 66045

lab phone:  785.864.5789
fax (shared): 785.864.5860

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/