Re: [R-sig-phylo] problem with boot.phylo() and sample order randomisation

Klaus Schliep Sun, 08 Jul 2012 04:54:11 -0700

Hi Alastair,

I discovered two problems with the analysis. One is your data, which
contains too many transitions and so you lost most of the information
about your the process which generated the data.
There is also a problem with prop.clades, which does not take care
enough for different orderings of the tip labels of the input trees.



Here is a small function, you will need:


checkLabels <- function(tree, tip){
    ind <- match(tip, tree$tip.label)
    tree$tip.label <- tree$tip.label[ind]
    ind2 <- match(1:length(ind), tree$edge[, 2])
    tree$edge[ind2, 2] <- order(ind)
    tree
}

I edited the text below.

Regards,
Klaus

On 7/8/12, Alastair Potts <pott...@gmail.com> wrote:
> Hi all,
>
> I was wondering whether anyone could offer advice on the following problem.
>
> When samples are closely related (say identical sequences) are
> analysed using the boot.phylo() function, the bootstrap support
> between these samples will be 100% explicitly because the sample order
> in boot.phylo() is not randomised. This means that closely related
> samples are always built with the same branching order even though the
> branch length may be 0. As the branching order is unchanged these
> branches are given high bootstrap values.

Is this really a problem? Maybe you just use di2multi to collapses the
corresponding dichotomies into a multichotomy.

>
> In previous version of the ape library, including the sample()
> function (see below) was sufficient to overcome this problem. However,
> when running some old analyses I am now getting very different
> results. Including the sample() function now leads to no bootstrap
> support for any branches.
>
> I have been playing with all combinations of the parameter settings
> that I can find for the boot.phylo() function, but can't seem to find
> a solution.
>
> I have included sample code below which I believe accurately
> represents the problem I am observing in my real datasets. Any
> suggestions or comments would be greatly appreciated.
>
> Thank you for your time,
> Alastair Potts
>
> library(ape)
> library(phangorn)
>
> windows()
> par(mfrow=c(1,2))
>
> # Create random tree, set some branch lengths to 0, and simulate DNA
> set.seed(1)
> tr <- rtree(20)
> plot(tr)
> #nodelabels()
> tr$edge.length[c(30:33,15:25)] <-0
> plot(tr)
>
> # The current figure shows the random tree before (left) and after (right)
> # shortening of branch lengths
>
> # Simulate DNA on the tree with shortened branch lengths
> set.seed(1)
> dna <- as.DNAbin(simSeq(tr))
>

# If you try
cophenetic(tr)
# you see that many distances are 4, and so you expect 4 transitions
per site between this
# sequences.
dist.dna(dna, "raw")
# contains many distances close to .75. If you sample these it is
likely you generate
# many NaNs later on in
dist.dna(dna, pairwise.deletion = T)
# or having a distance close to 750 in this case
dist.dna(dna,model="N", pairwise.deletion = T)
# For this reason:
tr$edge.length <- tr$edge.length / 5
set.seed(1)
dna <- as.DNAbin(simSeq(tr))



>
> ### Perform NJ and bootstrap without randomising the sample order
> # This incorrectly calculates high bootstrap values for short samples
> on short branches
> f <- function(x) nj(dist.dna(x,model="N", pairwise.deletion = T))
> tr1 <- f(dna)
> bb <- boot.phylo(tr1, dna, f ,B=100, trees=TRUE, rooted=T)
> tr1$node.labels <- prop.clades(tr1, bb$trees,rooted=F)
>
> plot(tr1);title("Without sample order randomisation")
> nodelabels(tr1$node.label)
>
> # Note that samples on incredibly short branch lengths have very high
> # bootstrap values
>
>
> ### Perform NJ and bootstrap WITH randomising the sample order
> # This used to work, but now does something weird. Using sample() to
> randomise the
> # input order. See below.
> f2 <- function(x) nj(dist.dna(x[sample(1:nrow(x)),],model="N",
> pairwise.deletion = T))
> tr2 <- f2(dna)
> bb2 <- boot.phylo(tr2, dna, f2 ,B=100, trees=TRUE, rooted=T)
# > tr2$node.labels <- prop.clades(tr2, bb2$trees,rooted=T)
# Here comes the fix
btr2 <- .compressTipLabel(bb2$trees)
tr2 <- checkLabels(tr2, attr(btr2, "TipLabel"))
btr2 <- .uncompressTipLabel(btr2)

tr2$node.labels <- prop.clades(tr2, btr2, rooted=F)

>
> plot(tr2);title("With sample order randomisation")
> nodelabels(tr2$node.label)
>
# plot is now ok!! Always 100% for bifurcations and lower for multifurcations!


>
>
> --
> Dr. Alastair J. Potts
> Claude Leon Postdoctoral Fellow
> Botany Department
> Nelson Mandela Metropolitan University
>
> Cell #: 082 491-7275
> Office #: 041 504-2398 (Mon, Wed, Fri)
>
> _______________________________________________
> R-sig-phylo mailing list
> R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] problem with boot.phylo() and sample order randomisation

Reply via email to