Re: [R-sig-phylo] HKY GTR distances

2017-02-07 Thread Klaus Schliep
Dear Andreas,
you could play around with dist.ml() in phangorn. It does ML distances
optimisation and you can supply a rate matrix and base frequencies for all
the distances. However this is of course slower than the closed form
solutions, but it can handle ambiguous states nicely. I can make a little
change to export (log-)likelihood values and one could write some code to
iterative optimize edge length and transition parameters.
You might even use a modelTest like approach and estimate the model
parameter on a tree (the tree does not need to be perfect) this may be
scaling better than pairwise distances if you have many sequences.
Klaus


On Fri, Feb 3, 2017 at 3:52 PM, Joe Felsenstein 
wrote:

> Emmanuel wrote:
>
> >
> > There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez et
> > al. developed a procedure to calculate a distance (also in Yang's 2006
> > book). An example is given below with the woodmouse:
> >
> > matlog <- function(x) {
> > tmp <- eigen(X)
> > V <- tmp$vectors
> > U <- diag(log(tmp$values))
> > V %*% U %*% solve(V)
> > }
> >
> > tr <- function(x) sum(diag(x))
> >
> > data(woodmouse)
> >
> > PI <- diag(base.freq(woodmouse[1:2, ]))
> > Ft <- Ftab(woodmouse[1:2, ])
> > F <- Ft/sum(Ft)
> > X <- solve(PI) %*% F
> > -tr(PI %*% matlog(X))
> >
> > You have to call this code for each pair of sequences (or wrap it in a
> > function).
> >
> > For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol).
> >
>
> Strictly speaking there *is* a formula for GTR, which I think may be
> equivalent to the above.  The formula is given in my book "Inferring
> Phylogenies" on page 209 as equation 13.24.  Note that for both of
> these, the equilibrium frequencies of the bases are estimated from the
> empirical frequencies in the two sequences.  That means that for each
> distance, the equilibrium frequencies are somewhat different, as
> different sequences are being used to estimate the base frequencies.
> We are not, in the use of these formulas, estimating one overall set
> of equilibrium frequencies and then using that for all the the
> distances in our data set.
>
> The Rzhetsky-Nei 1994 distance formula is, I believe, an
> approximation, though probably a very good one.  It is not quite the
> same as the estimate you would get by maximizing the likelihood.
>
> Distance formulas that involve empirically estimated base frequencies,
> which all of these do, have in common the problem that one either
> estimates those separately for each pair of sequences, or jointly
> estimates them from the whole dataset, without using a tree in the
> process.
>
> J.F.
> 
> Joe Felsenstein j...@gs.washington.edu
>  Department of Genome Sciences and Department of Biology,
>  University of Washington, Box 355065, Seattle, WA 98195-5065 USA
>
> ___
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-
> sig-ph...@r-project.org/
>



-- 
Klaus Schliep
Postdoctoral Fellow
Revell Lab, University of Massachusetts Boston
http://www.phangorn.org/

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] HKY GTR distances

2017-02-03 Thread Joe Felsenstein
Emmanuel wrote:

>
> There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez et
> al. developed a procedure to calculate a distance (also in Yang's 2006
> book). An example is given below with the woodmouse:
>
> matlog <- function(x) {
> tmp <- eigen(X)
> V <- tmp$vectors
> U <- diag(log(tmp$values))
> V %*% U %*% solve(V)
> }
>
> tr <- function(x) sum(diag(x))
>
> data(woodmouse)
>
> PI <- diag(base.freq(woodmouse[1:2, ]))
> Ft <- Ftab(woodmouse[1:2, ])
> F <- Ft/sum(Ft)
> X <- solve(PI) %*% F
> -tr(PI %*% matlog(X))
>
> You have to call this code for each pair of sequences (or wrap it in a
> function).
>
> For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol).
>

Strictly speaking there *is* a formula for GTR, which I think may be
equivalent to the above.  The formula is given in my book "Inferring
Phylogenies" on page 209 as equation 13.24.  Note that for both of
these, the equilibrium frequencies of the bases are estimated from the
empirical frequencies in the two sequences.  That means that for each
distance, the equilibrium frequencies are somewhat different, as
different sequences are being used to estimate the base frequencies.
We are not, in the use of these formulas, estimating one overall set
of equilibrium frequencies and then using that for all the the
distances in our data set.

The Rzhetsky-Nei 1994 distance formula is, I believe, an
approximation, though probably a very good one.  It is not quite the
same as the estimate you would get by maximizing the likelihood.

Distance formulas that involve empirically estimated base frequencies,
which all of these do, have in common the problem that one either
estimates those separately for each pair of sequences, or jointly
estimates them from the whole dataset, without using a tree in the
process.

J.F.

Joe Felsenstein j...@gs.washington.edu
 Department of Genome Sciences and Department of Biology,
 University of Washington, Box 355065, Seattle, WA 98195-5065 USA

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] HKY GTR distances

2017-02-03 Thread Emmanuel Paradis

Dear Andreas,

There is no distance formula for HKY or GTR model. For GTR, Rodrı́guez 
et al. developed a procedure to calculate a distance (also in Yang's 
2006 book). An example is given below with the woodmouse:


matlog <- function(x) {
tmp <- eigen(X)
V <- tmp$vectors
U <- diag(log(tmp$values))
V %*% U %*% solve(V)
}

tr <- function(x) sum(diag(x))

data(woodmouse)

PI <- diag(base.freq(woodmouse[1:2, ]))
Ft <- Ftab(woodmouse[1:2, ])
F <- Ft/sum(Ft)
X <- solve(PI) %*% F
-tr(PI %*% matlog(X))

You have to call this code for each pair of sequences (or wrap it in a 
function).


For HKY, Yang mentioned a procedure Rzhetsky & Nei (1994, J Mol Evol).

Best,

Emmanuel

Rodrı́guez F., Oliver J. L., Marı́n A. & Medina J. R. 1990. The general 
stochastic model of nucleotide substitution. Journal of Theoretical

Biology 142: 485–501.


Le 01/02/2017 à 23:52, kolte...@rub.de a écrit :

Dear Phylothusiasts,
I need to compare multiple substitution models side-by-side (species
clustering stuff by distances only). Unfortunately, I am not aware of an
implementation of HKY and GTR distance computations using R. Maybe there
is some Github code or something else that I have been missing?
I do not want to build a phylogenetic tree. I can not use PAUP (>500 big
fasta files).
Any ideas are greatly appreciated.
Best wishes,
Andreas

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/







___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] HKY GTR distances

2017-02-02 Thread Jacob Berv
Can you do this in MEGA?
Jake

> On Feb 1, 2017, at 5:52 PM, kolte...@rub.de  
> wrote:
> 
> Dear Phylothusiasts,
> I need to compare multiple substitution models side-by-side (species 
> clustering stuff by distances only). Unfortunately, I am not aware of an 
> implementation of HKY and GTR distance computations using R. Maybe there is 
> some Github code or something else that I have been missing?
> I do not want to build a phylogenetic tree. I can not use PAUP (>500 big 
> fasta files).
> Any ideas are greatly appreciated.
> Best wishes,
> Andreas
> 
> ___
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/