Do you know of a more efficient way to collapse a list of DNAStringSet objects 
into a single DNAStringSet? I'm trying to parse an annotated assembly by 
grabbing the longest contig that hits each swissprot gene, where the gene id is 
in the name of each sequence.

The way I found that works, but is very slow is to convert them to a list of 
character strings, and then back to a DNAStringSet:

longestContigs <- DNAStringSet(sapply(longestContigs, as.character))

Here's the full example:

library(Biostrings)
contigsWithHits <- read.DNAStringSet("transcripts.fa")

# extract only swissprot gene names:
geneNames <- gsub("^Locus_\\d+_Transcript_\\d+/\\d+_Confidence_[0-9.]+_(.+)$", 
"\\1", names(contigsWithHits), perl=TRUE)

# keep longest from each annotation gene group:

getLongest <- function(contigList){
    contigWidth <- width(contigList)
    return(contigList[which.max(contigWidth)])
}

# apply getLongest to each group:
longestContigs <- tapply(contigsWithHits, geneNames, getLongest)
contigNames <- sapply(longestContigs, names)
# collapse list of DNAStringSet objects back into a single DNAStringSet
longestContigs <- DNAStringSet(sapply(longestContigs, as.character))
# reapply names:
names(longestContigs) <- contigNames

Sincerely,
Tyler William H Backman
Cheminformatics Programmer
Department of Botany and Plant Sciences
E-mail: [email protected]
1207E Genomics Building
University of California
Riverside, CA 92521

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to