Re: [R-sig-phylo] HKY GTR distances

```Dear Andreas,
you could play around with dist.ml() in phangorn. It does ML distances
optimisation and you can supply a rate matrix and base frequencies for all
the distances. However this is of course slower than the closed form
solutions, but it can handle ambiguous states nicely. I can make a little
change to export (log-)likelihood values and one could write some code to
iterative optimize edge length and transition parameters.
You might even use a modelTest like approach and estimate the model
parameter on a tree (the tree does not need to be perfect) this may be
scaling better than pairwise distances if you have many sequences.
Klaus```
```

On Fri, Feb 3, 2017 at 3:52 PM, Joe Felsenstein <j...@gs.washington.edu>
wrote:

> Emmanuel wrote:
>
> >
> > There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez et
> > al. developed a procedure to calculate a distance (also in Yang's 2006
> > book). An example is given below with the woodmouse:
> >
> > matlog <- function(x) {
> >     tmp <- eigen(X)
> >     V <- tmp\$vectors
> >     U <- diag(log(tmp\$values))
> >     V %*% U %*% solve(V)
> > }
> >
> > tr <- function(x) sum(diag(x))
> >
> > data(woodmouse)
> >
> > PI <- diag(base.freq(woodmouse[1:2, ]))
> > Ft <- Ftab(woodmouse[1:2, ])
> > F <- Ft/sum(Ft)
> > X <- solve(PI) %*% F
> > -tr(PI %*% matlog(X))
> >
> > You have to call this code for each pair of sequences (or wrap it in a
> > function).
> >
> > For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol).
> >
>
> Strictly speaking there *is* a formula for GTR, which I think may be
> equivalent to the above.  The formula is given in my book "Inferring
> Phylogenies" on page 209 as equation 13.24.  Note that for both of
> these, the equilibrium frequencies of the bases are estimated from the
> empirical frequencies in the two sequences.  That means that for each
> distance, the equilibrium frequencies are somewhat different, as
> different sequences are being used to estimate the base frequencies.
> We are not, in the use of these formulas, estimating one overall set
> of equilibrium frequencies and then using that for all the the
> distances in our data set.
>
> The Rzhetsky-Nei 1994 distance formula is, I believe, an
> approximation, though probably a very good one.  It is not quite the
> same as the estimate you would get by maximizing the likelihood.
>
> Distance formulas that involve empirically estimated base frequencies,
> which all of these do, have in common the problem that one either
> estimates those separately for each pair of sequences, or jointly
> estimates them from the whole dataset, without using a tree in the
> process.
>
> J.F.
> ----
> Joe Felsenstein         j...@gs.washington.edu
>  Department of Genome Sciences and Department of Biology,
>  University of Washington, Box 355065, Seattle, WA 98195-5065 USA
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-
> sig-ph...@r-project.org/
>

--
Klaus Schliep
Postdoctoral Fellow
Revell Lab, University of Massachusetts Boston
http://www.phangorn.org/

[[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/```