Do you know of a more efficient way to collapse a list of DNAStringSet objects
into a single DNAStringSet? I'm trying to parse an annotated assembly by
grabbing the longest contig that hits each swissprot gene, where the gene id is
in the name of each sequence.
The way I found that works, but is very slow is to convert them to a list of
character strings, and then back to a DNAStringSet:
longestContigs <- DNAStringSet(sapply(longestContigs, as.character))
Here's the full example:
library(Biostrings)
contigsWithHits <- read.DNAStringSet("transcripts.fa")
# extract only swissprot gene names:
geneNames <- gsub("^Locus_\\d+_Transcript_\\d+/\\d+_Confidence_[0-9.]+_(.+)$",
"\\1", names(contigsWithHits), perl=TRUE)
# keep longest from each annotation gene group:
getLongest <- function(contigList){
contigWidth <- width(contigList)
return(contigList[which.max(contigWidth)])
}
# apply getLongest to each group:
longestContigs <- tapply(contigsWithHits, geneNames, getLongest)
contigNames <- sapply(longestContigs, names)
# collapse list of DNAStringSet objects back into a single DNAStringSet
longestContigs <- DNAStringSet(sapply(longestContigs, as.character))
# reapply names:
names(longestContigs) <- contigNames
Sincerely,
Tyler William H Backman
Cheminformatics Programmer
Department of Botany and Plant Sciences
E-mail: [email protected]
1207E Genomics Building
University of California
Riverside, CA 92521
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing