Luke Harmon wrote:
Yes Joe is correct, there is more to this problem than meets the
eye. My implementation assumes equal probability of each unknown
state, which is quite different from modeling an actual polymorphic
character. I'm sure that doing something different might matter in
many cases.
Assuming equal probability of each possible state might be thought of
as a model of ambiguity of state, not polymorphism. But even for
that it is not a complete likelihood treatment. In likelihood
machinery, one uses conditional likelihoods, which give a likelihood
of 1 to each possible state. This is not as crazy as it sounds (see
pages 255-256 of my book). It is simply that what we have in the
conditional likelihoods is NOT the probability of the state, but the
probability of the ambiguous observation given the state. So, for
example, if we see a purine but don't know whether it is A or G (in a
DNA sequence case), the probability of seeing purine, given that we
only can see purineness or pyrimidineness, and the state really is A,
is 1, and similarly if it is really G. So the conditional
likelihoods for the four nucleotides are (1,0,1,0). Sounds wrong but
it isn't.
Polymorphism is totally different: you have actually seen both states.
For discrete 0/1 characters, one can use Sewall Wright's (1934)
threshold model which I have discussed (briefly in the book and more
extensively in a 2005 paper in the Philosophical Transactions of the
Royal Society B). I have a paper under revision at a major journal
about it and will release my program Threshml soon in a pre-PHYLIP
version. Unlike Mark Pagel and Paul Lewis's Mk model, it predicts
polymorphism in a natural way. The population has an underlying
unobservable quantitative character, the "liability", that implies
some frequency of both 0 and 1 states. I think Ted Garland and
others also use a log-linear model that has somewhat similar
properties but is not exactly the same.
To get these models to deal with multiple character states is
possible but very very nontrivial. If you see states 0, 1, 2, is 1
intermediate between 0 and 2, or is it off at right angles to both?
There are possible threshold models that could do either -- telling
the difference between them requires lots of data. With, say, 6
states it would be a nightmare.
Joe
----
Joe Felsenstein, j...@gs.washington.edu
Dept. of Genome Sciences, Univ. of Washington
Box 355065, Seattle, WA 98195-5065 USA
_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo