Hello All,

I have a COI FASTA alignment file with 11926 sequences spanning 668 species.

According to my code, a total of 361 species are represented by 6 or more
sequences.

I need to extract these 361 species (along with all their associated
sequences) from the alignment but am having issues.

Here is my code:

    library(ape)
    library(pegas)
    library(spider)
    library(stringr)

    seqs <- read.dna(file = file.choose(), format = "fasta") # import data
and convert to matrix
    seqs.mat <- as.matrix(seqs)

    spp <- substr(dimnames(seqs.mat)[[1]], 1, 50) # extract sequence labels

    res <- str_remove(spp, "^[^|]+\\|") # remove BOLD Process ID

    res <- table(res)[which(table(res) >= 6)] # species with 6 or more
records
    names(res) # 361 species names


The problem is that `names(res)' contains unique species names.rather than
repeated species names (according to the BOLD process ID).

I also need the sequences. There are between 6-421 sequences across the 361
species (11143 sequences total).

Does anyone have any ideas on next steps here?

Once this is done, I would then write the sequences to a file via
`write.dna()`

If clarification is needed, please let me know.


Thanks.

Cheers,

Jarrett

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to