Re: [R-sig-phylo] Extracting sequences from DNAbin according to a condition

Emmanuel Paradis Sun, 27 Jun 2021 20:57:07 -0700

Hi Jarrett,

----- Le 28 Juin 21, à 5:08, Jarrett Phillips phillipsjarre...@gmail.com a 
écrit :
> Hello All,
> 
> I have a COI FASTA alignment file with 11926 sequences spanning 668 species.
> 
> According to my code, a total of 361 species are represented by 6 or more
> sequences.
> 
> I need to extract these 361 species (along with all their associated
> sequences) from the alignment but am having issues.
> 
> Here is my code:
> 
>    library(ape)
>    library(pegas)
>    library(spider)
>    library(stringr)
> 
>    seqs <- read.dna(file = file.choose(), format = "fasta") # import data
> and convert to matrix
>    seqs.mat <- as.matrix(seqs)
> 
>    spp <- substr(dimnames(seqs.mat)[[1]], 1, 50) # extract sequence labels
> 
>    res <- str_remove(spp, "^[^|]+\\|") # remove BOLD Process ID


I suggest you store the output of table() in a distinct object so you keep 
'res':

tab.res <- tab(res)

Besides, table() can be a bit long so it's better to avoid duplicating its 
call. Then you can do:

sel <- tab.res >= 6
species.to.keep <- names(tab.res)[sel] # which() is not necessary here
length(species.to.keep) # should be 361
i <- which(res %in% species.to.keep)
write.dna(seqs[i, ], .....

This way, you can change your selection criterion easily. Suppose you want to 
select the species with at least 5 sequences but no more than 10, then you'd 
only need to change the first line:

sel <- tab.res >= 5 & tab.res <= 10

HTH.

Cheers,

Emmanuel

>    res <- table(res)[which(table(res) >= 6)] # species with 6 or more
> records
>    names(res) # 361 species names
> 
> 
> The problem is that `names(res)' contains unique species names.rather than
> repeated species names (according to the BOLD process ID).
> 
> I also need the sequences. There are between 6-421 sequences across the 361
> species (11143 sequences total).
> 
> Does anyone have any ideas on next steps here?
> 
> Once this is done, I would then write the sequences to a file via
> `write.dna()`
> 
> If clarification is needed, please let me know.
> 
> 
> Thanks.
> 
> Cheers,
> 
> Jarrett
> 
>       [[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Extracting sequences from DNAbin according to a condition

Reply via email to