Re: [R-sig-phylo] Model-Selection vs. Finding Models that Fit Well
I agree with Dave here. White noise has two parameters, mean and variance, and - to me - is an interesting model to test. But I'm not sure it should be considered as a baseline. One can link Brownian motion and white noise through the Ornstein-Uhlenbeck model - BM is OU with alpha (constraint) parameter equal to zero, and WN is OU with infinite alpha. lh On Jan 30, 2011, at 3:18 PM, David Bapst wrote: Florian- Doesn't white noise have two parameters, mean and variance, and thus is just as complex as the Brownian Motion model? I guess LRT could be done against both. That said, it isn't clear to me that what it means for the White Noise to fit best, as WN is interpretable as an evolutionary scenario, so it doesn't seem a very clean null. If WN fits best in a set of models, the trait examined must be exceedingly low in phylogenetic signal and thus the trait must be evolving under one of the scenarios which can produce low signal (OU with high rates and strong attraction, and/or evolution where descendant values are not a function of ancestral values). -Dave On Fri, Jan 28, 2011 at 5:31 PM, Florian Boucher floflobouc...@gmail.comwrote: Hi David and list, just a quick comment on one of your questions : for quantitative traits on a phylogeny you can compare your best model to the white noise model implemented in geiger, which assumes that your traits are drawn from a normal distribution. This last model would be the baseline model Ted evoked in his post. I hope it helps... Florian Boucher PhD student, Laboratoire d'Ecologie Alpine, Grenoble, France 2011/1/28 David Bapst dwba...@uchicago.edu Hello all, Apologies for leaving the replies to get cold for a week, but now I finally have some time to respond. On Thu, Jan 20, 2011 at 12:17 PM, Brian O'Meara omeara.br...@gmail.com wrote: I think considering model adequacy is something that would be useful to do and is not done much now. One general way to do this is to simulate under your chosen model and see if the real data look very different from the simulated data. For example, I might try a one rate vs. a two rate Brownian motion model and find the latter fits better. If the actual true model is an OU model with two very distinct peaks and strong selection, which is not in my model set, I'll find that my simulated data under the two rate Brownian model may look very different from my actual data, which will be fairly bimodal. Aha, my model is inadequate. [but then what -- keep adding new models, just report that your model is inadequate, ...?] Certainly, data exploration is a step that cannot be skipped; I've found that Ackerly's traitgrams work well for me for visualizing my data, although I know some people who find them simply confusing (particularly my traitgrams, as they have fossil taxa all over them). Of course, you need a method for evaluating how similar the data look. There's been some work on this in models for tree inference using posterior predictive performance (work by Jonathan Bollback and Jeremy Brown come to mind) or using other approaches (some of Peter Waddell's work), but it hasn't really taken off yet. It'd be easy to implement such approaches in R for comparative methods given capabilities in ape, Geiger, and other packages. I don't entirely follow, but I'll look in to the posterior predictive performance work you mention. But wouldn't how 'similar the data look' would all be a function of what we want to look at, would it not? On Thu, Jan 20, 2011 at 12:27 PM, tgarl...@ucr.edu wrote: One quick comment. In many cases what you can do is also fit a model with no independent variables. It, too, will have a likelihood, and can be used as a baseline to argue whether your best model is actually any good. You could then do a maximum likelihood ratio test of your best model versus your baseline model. If the baseline model does not have a significant lack of fit by a LRT (e.g., P not 0.05), then your best model arguably isn't of much use. Cheers, Ted That seems like a pretty good idea! How would we go about doing that an example case, such as for a continuous trait on a phylogeny? On Thu, Jan 20, 2011 at 2:00 PM, Nick Matzke mat...@berkeley.edu wrote: If one is interested in absolute goodness of fit, rather than model comparison (which model fits best, which might not be useful if you are worried that all your models are horrible), wouldn't cross-validation be a good technique? I.e. leave out one tip, calculate the model and the estimated node values from the rest, then put an estimate and uncertainty on the tip and see how often it matches the observed value. Repeat for all observations... Would jack-knifing data like that be appropriate for the continuous trait analyses? That would seem to deal with whether we have enough data to distinguish the models at hand (which is still important), but not
Re: [R-sig-phylo] Model-Selection vs. Finding Models that Fit Well
I don't find the white noise to be any good evolutionary scenario: it's nowhere continuous. It just reduces to the assumption of normal, independent observations at the tips. Nothing fancy, then :) Cecile. On 01/31/11 11:53, Luke Harmon wrote: I agree with Dave here. White noise has two parameters, mean and variance, and - to me - is an interesting model to test. But I'm not sure it should be considered as a baseline. One can link Brownian motion and white noise through the Ornstein-Uhlenbeck model - BM is OU with alpha (constraint) parameter equal to zero, and WN is OU with infinite alpha. lh On Jan 30, 2011, at 3:18 PM, David Bapst wrote: ... -- Cecile Ane Assistant Professor, University of Wisconsin - Madison www.stat.wisc.edu/~ane/ CALS statistical consulting lab: www.cals.wisc.edu/calslab/stat.html ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Model-Selection vs. Finding Models that Fit Well
Hi Joe et al., One point for clarification and your further thoughts. The way parameterize the OU process in Lavin et al. (2008) it is a value of zero (not infinity) that gives a start phylogeny with contemporaneous tips. Sometimes the ML estimate of d (what we call it) goes to zero, but more often it may go very small but not zero. In either case, it seems to me you can do a LRT versus a star with one d.f. Cheers, Ted Original message Date: Mon, 31 Jan 2011 13:02:38 -0800 From: Joe Felsenstein j...@gs.washington.edu Subject: Re: [R-sig-phylo] Model-Selection vs. Finding Models that Fit Well To: David Bapst dwba...@uchicago.edu Cc: R Sig Phylo Listserv r-sig-phylo@r-project.org, Ted Garland tgarl...@ucr.edu David Bapst and Cecile Ane noted that On Mon, Jan 31, 2011 at 12:45 PM, Cecile Ane a...@stat.wisc.edu wrote: I don't find the white noise to be any good evolutionary scenario: it's nowhere continuous. It just reduces to the assumption of normal, independent observations at the tips. Nothing fancy, then :) I think it is a bit problematic that WN could fit equally well to a situation with zero heritability and an OU1 scenario with infinite alpha. I guess one could check if there is any relationship between the magnitude of evolutionary change and distance from the mean (as it is generally expected under OU that changes are more minor closer to the optima). But perhaps even that relationship goes away as alpha approaches infinity. It would also be difficult to assess, because phylogenetic signal would be too low for ancestral state reconstruction to work with confidence. Lack of continuity does not worry me: if we have (say) discrete generations as in annual plants, nothing is continuous anyway. And our observations are rarely closely-enough spaced to be sensitive to that issue anyway. The problem with WN (i.e. tips i.i.d. and from multivariate normal) is that, although it connects to OU and to other models, in each case that is in a limit as some parameter goes off to infinity. In those cases I think WN as a null hypothesis is going to be difficult, as it will not be contained within the alternative hypotheses in a way that allows a Likelihood Ratio Test. And I suspect that it would give trouble also in a Bayesian framework, though that is just a suspicion. Joe Joe Felsenstein j...@gs.washington.edu Department of Genome Sciences and Department of Biology, University of Washington, Box 355065, Seattle, WA 98195-5065 USA [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Model-Selection vs. Finding Models that Fit Well
Ted said -- One point for clarification and your further thoughts. The way parameterize the OU process in Lavin et al. (2008) it is a value of zero (not infinity) that gives a start phylogeny with contemporaneous tips. Sometimes the ML estimate of d (what we call it) goes to zero, but more often it may go very small but not zero. In either case, it seems to me you can do a LRT versus a star with one d.f. Can you? Is the value of zero contained within the range of possible values, or at its extreme? I think for LRT you need contained within. Joe Joe Felsenstein j...@gs.washington.edu Department of Genome Sciences and Department of Biology, University of Washington, Box 355065, Seattle, WA 98195-5065 USA ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Model-Selection vs. Finding Models that Fit Well
The link between BM and WN is even closer than that: WN is the derivative of a BM process. Now, BM is nowhere differentiable, so in the usual sense, WN doesn't really exist. However, it can be approximated by simulation. Cheers, Simon. On 01/02/11 03:53, Luke Harmon wrote: I agree with Dave here. White noise has two parameters, mean and variance, and - to me - is an interesting model to test. But I'm not sure it should be considered as a baseline. One can link Brownian motion and white noise through the Ornstein-Uhlenbeck model - BM is OU with alpha (constraint) parameter equal to zero, and WN is OU with infinite alpha. lh On Jan 30, 2011, at 3:18 PM, David Bapst wrote: Florian- Doesn't white noise have two parameters, mean and variance, and thus is just as complex as the Brownian Motion model? I guess LRT could be done against both. That said, it isn't clear to me that what it means for the White Noise to fit best, as WN is interpretable as an evolutionary scenario, so it doesn't seem a very clean null. If WN fits best in a set of models, the trait examined must be exceedingly low in phylogenetic signal and thus the trait must be evolving under one of the scenarios which can produce low signal (OU with high rates and strong attraction, and/or evolution where descendant values are not a function of ancestral values). -Dave On Fri, Jan 28, 2011 at 5:31 PM, Florian Boucherfloflobouc...@gmail.comwrote: Hi David and list, just a quick comment on one of your questions : for quantitative traits on a phylogeny you can compare your best model to the white noise model implemented in geiger, which assumes that your traits are drawn from a normal distribution. This last model would be the baseline model Ted evoked in his post. I hope it helps... Florian Boucher PhD student, Laboratoire d'Ecologie Alpine, Grenoble, France 2011/1/28 David Bapstdwba...@uchicago.edu Hello all, Apologies for leaving the replies to get cold for a week, but now I finally have some time to respond. On Thu, Jan 20, 2011 at 12:17 PM, Brian O'Mearaomeara.br...@gmail.com wrote: I think considering model adequacy is something that would be useful to do and is not done much now. One general way to do this is to simulate under your chosen model and see if the real data look very different from the simulated data. For example, I might try a one rate vs. a two rate Brownian motion model and find the latter fits better. If the actual true model is an OU model with two very distinct peaks and strong selection, which is not in my model set, I'll find that my simulated data under the two rate Brownian model may look very different from my actual data, which will be fairly bimodal. Aha, my model is inadequate. [but then what -- keep adding new models, just report that your model is inadequate, ...?] Certainly, data exploration is a step that cannot be skipped; I've found that Ackerly's traitgrams work well for me for visualizing my data, although I know some people who find them simply confusing (particularly my traitgrams, as they have fossil taxa all over them). Of course, you need a method for evaluating how similar the data look. There's been some work on this in models for tree inference using posterior predictive performance (work by Jonathan Bollback and Jeremy Brown come to mind) or using other approaches (some of Peter Waddell's work), but it hasn't really taken off yet. It'd be easy to implement such approaches in R for comparative methods given capabilities in ape, Geiger, and other packages. I don't entirely follow, but I'll look in to the posterior predictive performance work you mention. But wouldn't how 'similar the data look' would all be a function of what we want to look at, would it not? On Thu, Jan 20, 2011 at 12:27 PM,tgarl...@ucr.edu wrote: One quick comment. In many cases what you can do is also fit a model with no independent variables. It, too, will have a likelihood, and can be used as a baseline to argue whether your best model is actually any good. You could then do a maximum likelihood ratio test of your best model versus your baseline model. If the baseline model does not have a significant lack of fit by a LRT (e.g., P not 0.05), then your best model arguably isn't of much use. Cheers, Ted That seems like a pretty good idea! How would we go about doing that an example case, such as for a continuous trait on a phylogeny? On Thu, Jan 20, 2011 at 2:00 PM, Nick Matzkemat...@berkeley.edu wrote: If one is interested in
Re: [R-sig-phylo] Simulate discrete characters on tree constrained tophylogenetic signal K
Dear Enrico/Emmanuel, Actually, I just realized I need continuous traits, not discrete. Sorry about that. Thanks so much for pointing me towards rTrait(using Cont). Would it be right that a one optima OU model with rTraitCont would have theta set to one single value? And a two optima model would have theta at two possible values for each branch? Etc.? Enrico, thanks very much for that thought. So I am simulating trees and traits on trees, and then calculating K using the phylosignal function. Thus, K is implicitly manipulated, instead of explicitly manipulated. Scott On Monday, January 31, 2011 at 3:28 AM, Enrico Rezende wrote: Scott, one important caveat that should be added is that, even when one knows and controls the parameters of the evolutionary model, the resulting dataset will have the expected amount of signal only ON AVERAGE, given the inherent stochasticity of these models. For instance, if you want K = 1 and assume Brownian motion character evolution on a given tree, you may still get data that have a K ranging between, say, 0.6 and 1.4 (with a mean K = 1, of course). Thus, you might want to quantify K after the simulations to ensure that you're obtaining the amount of signal desired. Ah yes, this applies primarily to quantitative traits, for qualitative characters things change a bit. Enrico Emmanuel Paradis escribió: On Fri, 28 Jan 2011 17:20:57 -0600 Scott Chamberlain myrmecocys...@gmail.com wrote: Dear R community, I would like to simulate discrete characters on a randomly generated tree. However, I would like to create different sets of trees and associated characters at certain levels of phylogenetic signal. If you know the relationship between phylogenetic signal (Blomberg et al's K in your case) and the parameters of the model used to simulate the characters, this would be easy. But I don't think this relationship is obvious, at least for discrete characters. Keep in mind also that, for a given model and parameters of evolution, the quantity of phylogenetic signal depends on the tree. Can someone point me in the right direction? I am familiar with sim.char from geiger package for simulating characters, and rcoal for simulating the ultrametric trees that I want, but I don't know how I could, if at all, simulate characters that have a particular a priori set phylogenetic signal. BTW, instead of rcoal() you may wish to use rbdtree(). And if you want more flexibility for simulating the characters, see rTraitDisc() in ape. Best, Emmanuel As a simple example, the following example gets phylogenetic signal (K) that is all over the map in different runs: q - list(rbind(c(-.5, .5), c(.5, -.5))) traits - sim.char(tree, q, model = discrete, n = 1) traits_ - as.data.frame(traits[,,1]) traitsvec - traits_[,1] names(traitsvec) - rownames(traits_) phylosignal(traitsvec, tree, reps = 100) Sincerely, Scott Chamberlain Rice University, EEB Dept. [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo -- *** Enrico L. Rezende Departament de Genètica i de Microbiologia Facultat de Biociències, Edifici Cn Universitat Autònoma de Barcelona 08193 Bellaterra (Barcelona) SPAIN Telephone: +34 93 581 4705 Fax: +34 93 581 2387 E-mail: enrico.reze...@uab.cat *** [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo