Re: [R-sig-phylo] Extracting sequences from DNAbin according to a condition

2021-06-27 Thread Emmanuel Paradis
Hi Jarrett,

- Le 28 Juin 21, à 5:08, Jarrett Phillips phillipsjarre...@gmail.com a 
écrit :
> Hello All,
> 
> I have a COI FASTA alignment file with 11926 sequences spanning 668 species.
> 
> According to my code, a total of 361 species are represented by 6 or more
> sequences.
> 
> I need to extract these 361 species (along with all their associated
> sequences) from the alignment but am having issues.
> 
> Here is my code:
> 
>library(ape)
>library(pegas)
>library(spider)
>library(stringr)
> 
>seqs <- read.dna(file = file.choose(), format = "fasta") # import data
> and convert to matrix
>seqs.mat <- as.matrix(seqs)
> 
>spp <- substr(dimnames(seqs.mat)[[1]], 1, 50) # extract sequence labels
> 
>res <- str_remove(spp, "^[^|]+\\|") # remove BOLD Process ID

I suggest you store the output of table() in a distinct object so you keep 
'res':

tab.res <- tab(res)

Besides, table() can be a bit long so it's better to avoid duplicating its 
call. Then you can do:

sel <- tab.res >= 6
species.to.keep <- names(tab.res)[sel] # which() is not necessary here
length(species.to.keep) # should be 361
i <- which(res %in% species.to.keep)
write.dna(seqs[i, ], .

This way, you can change your selection criterion easily. Suppose you want to 
select the species with at least 5 sequences but no more than 10, then you'd 
only need to change the first line:

sel <- tab.res >= 5 & tab.res <= 10

HTH.

Cheers,

Emmanuel

>res <- table(res)[which(table(res) >= 6)] # species with 6 or more
> records
>names(res) # 361 species names
> 
> 
> The problem is that `names(res)' contains unique species names.rather than
> repeated species names (according to the BOLD process ID).
> 
> I also need the sequences. There are between 6-421 sequences across the 361
> species (11143 sequences total).
> 
> Does anyone have any ideas on next steps here?
> 
> Once this is done, I would then write the sequences to a file via
> `write.dna()`
> 
> If clarification is needed, please let me know.
> 
> 
> Thanks.
> 
> Cheers,
> 
> Jarrett
> 
>   [[alternative HTML version deleted]]
> 
> ___
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


Re: [R-sig-phylo] Using bind.tip and lapply on a multiPhylo object

2021-06-27 Thread Russell Engelman
Dear Everyone,

Sorry for the late reply. I had a crunch with several manuscripts that had
a looming due date, and am coming back to trying to fix the issue I am
having with the phylogeny I emailed about before. I've been trying
several things but I've still been unable to resolve the issue with the
error message

"Error in bind.tree(tree, tip, where = where, position = pp) :
  'position' is larger than the branch length"

despite the fact that the length of the terminal sister taxon to its
nearest node is much greater than the new branch length (i.e., there should
be plenty of length to attach a taxon). In many of the cases where I am
getting an error I am attaching a sister taxon to a terminal taxon, so it
doesn't seem like a lack of monophyly across the trees is the problem.

The only thing I could think of is trying to manually adjust the next node
up to give it enough room to attach the new terminal taxon, but I am unsure
what the easiest way to do that in R would be. I've never tried to mess
with branch lengths in a tree before.

With regards to what Dr. Miller said, I agree that birth-death trees seem
to suggest much younger divergence times than other studies, however I am
more struggling with the practical side of things. What I am trying to do
with this dataset is include enough taxa to perform a PGLS analysis, and
all I need is a phylogenetic tree to use as a backbone. The Upham et al.
2019 trees are about the only easily accessible trees I have been able to
find for Mammalia, I know there was an older one that Dr. Upham suggested
that was published in 2014 but I don't recall if that one was easily
available.

Sincerely,
Russell


On Fri, Jun 4, 2021 at 3:27 PM Eliot Miller  wrote:

> Russell,
>
> If you're trying to do this across a lot of species, you're going to bump
> into a lot of complexities like this. I got the impression you were trying
> to do this for many species. For example, there's no reason that the taxon
> you're adding is, even if we assume monophyly of the clade (genus in this
> example) in question, nested somewhere in the crown clade. It might be
> sister to the clade. So you should theoretically be considering that node
> as a possible place to add the taxon. Adding the taxon in a polytomy at the
> MRCA and then randomly resolving that should do that (I assume, though do
> those operations also consider moving stemwards?). You can also randomly
> choose a random valid node that doesn't render the clade polyphyletic,
> split on that node and use a birth-death model, or any number of other
> rules (midpoint, polytomy, uniform distribution) to decide how to divvy up
> branch lengths and keep the tree ultrametric. This is what's happening with
> addTaxa. Also, if you hit a situation where you're trying to add a taxon to
> a set of taxa that are not actually monophyletic, you'll get weird results
> with your current approach. If that's not your situation, no problems
> there, but you ought to check (or just use addTaxa which will make sure it
> doesn't make the clade any "more" polyphyletic, whatever that means). FWIW,
> I'm leery of using birth-death models to add lots of missing species
> because I think it biases the resulting tree towards increasing
> diversification in the present. At least it does with addTaxa. This might
> be a programming error on my part, but I actually suspect it's a systematic
> bias introduced by problems described in recent papers by Stilianos Louca
> and Matthew Pennell, e.g.
> https://www.sciencedirect.com/science/article/abs/pii/S0960982221006138?dgcid=coauthor
> Figuring out if incorporating new birth-death estimators (that don't cap
> extinction at 0) could resolve some of that bias was on my to-do list for
> addTaxa, and is why I've never gotten around to more formally publishing
> tests of it. Also, I'm fairly sure Jonathan Chang et al. wrote a relevant R
> package here which also automates most of this, and used it for a big fish
> phylogeny. That'd be worth a look if you're trying to do this across
> hundreds or thousands of species.
>
> Eliot
>
> On Fri, Jun 4, 2021 at 2:59 PM Russell Engelman 
> wrote:
>
>> Dear Everyone,
>>
>> So this is really strange. I tried what Dr. Revell suggested, and for the
>> two examples I gave it worked for *Cercopithecus albogularis*, but when
>> I applied it to *Cricetomys ansorgei *I got the message...
>>
>> "Error in bind.tree(tree, tip, where = where, position = pp) :
>>   'position' is larger than the branch length"
>>
>> I know that usually this means the new node position is usually further
>> back in time than the point where the branch diverges from its nearest
>> neighbor. However when I went to check how long the terminal branch leading
>> to *C. ansorgei*'s sister taxon that I am trying to bind it to is, *C.
>> emini*, it said that the branch length to this taxon is much, much
>> greater than the branch length of the new taxon I am trying to set (3.28
>> Ma), so it is not clear why it is 

[R-sig-phylo] Extracting sequences from DNAbin according to a condition

2021-06-27 Thread Jarrett Phillips
Hello All,

I have a COI FASTA alignment file with 11926 sequences spanning 668 species.

According to my code, a total of 361 species are represented by 6 or more
sequences.

I need to extract these 361 species (along with all their associated
sequences) from the alignment but am having issues.

Here is my code:

library(ape)
library(pegas)
library(spider)
library(stringr)

seqs <- read.dna(file = file.choose(), format = "fasta") # import data
and convert to matrix
seqs.mat <- as.matrix(seqs)

spp <- substr(dimnames(seqs.mat)[[1]], 1, 50) # extract sequence labels

res <- str_remove(spp, "^[^|]+\\|") # remove BOLD Process ID

res <- table(res)[which(table(res) >= 6)] # species with 6 or more
records
names(res) # 361 species names


The problem is that `names(res)' contains unique species names.rather than
repeated species names (according to the BOLD process ID).

I also need the sequences. There are between 6-421 sequences across the 361
species (11143 sequences total).

Does anyone have any ideas on next steps here?

Once this is done, I would then write the sequences to a file via
`write.dna()`

If clarification is needed, please let me know.


Thanks.

Cheers,

Jarrett

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/