Dear Andreas, you could play around with dist.ml() in phangorn. It does ML distances optimisation and you can supply a rate matrix and base frequencies for all the distances. However this is of course slower than the closed form solutions, but it can handle ambiguous states nicely. I can make a little change to export (log-)likelihood values and one could write some code to iterative optimize edge length and transition parameters. You might even use a modelTest like approach and estimate the model parameter on a tree (the tree does not need to be perfect) this may be scaling better than pairwise distances if you have many sequences. Klaus

## Advertising

On Fri, Feb 3, 2017 at 3:52 PM, Joe Felsenstein <j...@gs.washington.edu> wrote: > Emmanuel wrote: > > > > > There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez et > > al. developed a procedure to calculate a distance (also in Yang's 2006 > > book). An example is given below with the woodmouse: > > > > matlog <- function(x) { > > tmp <- eigen(X) > > V <- tmp$vectors > > U <- diag(log(tmp$values)) > > V %*% U %*% solve(V) > > } > > > > tr <- function(x) sum(diag(x)) > > > > data(woodmouse) > > > > PI <- diag(base.freq(woodmouse[1:2, ])) > > Ft <- Ftab(woodmouse[1:2, ]) > > F <- Ft/sum(Ft) > > X <- solve(PI) %*% F > > -tr(PI %*% matlog(X)) > > > > You have to call this code for each pair of sequences (or wrap it in a > > function). > > > > For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol). > > > > Strictly speaking there *is* a formula for GTR, which I think may be > equivalent to the above. The formula is given in my book "Inferring > Phylogenies" on page 209 as equation 13.24. Note that for both of > these, the equilibrium frequencies of the bases are estimated from the > empirical frequencies in the two sequences. That means that for each > distance, the equilibrium frequencies are somewhat different, as > different sequences are being used to estimate the base frequencies. > We are not, in the use of these formulas, estimating one overall set > of equilibrium frequencies and then using that for all the the > distances in our data set. > > The Rzhetsky-Nei 1994 distance formula is, I believe, an > approximation, though probably a very good one. It is not quite the > same as the estimate you would get by maximizing the likelihood. > > Distance formulas that involve empirically estimated base frequencies, > which all of these do, have in common the problem that one either > estimates those separately for each pair of sequences, or jointly > estimates them from the whole dataset, without using a tree in the > process. > > J.F. > ---- > Joe Felsenstein j...@gs.washington.edu > Department of Genome Sciences and Department of Biology, > University of Washington, Box 355065, Seattle, WA 98195-5065 USA > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at http://www.mail-archive.com/r- > sig-ph...@r-project.org/ > -- Klaus Schliep Postdoctoral Fellow Revell Lab, University of Massachusetts Boston http://www.phangorn.org/ [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/