[Bioc-sig-seq] Fast inject into XStringSet objects

Thomas Girke Mon, 04 May 2009 16:50:22 -0700

Dear Developers,

Is there an efficient method available for replacing base calls in 
sequences by Ns (or any other character) at positions where the quality 
score drops below some arbitrary threshold. So far we have been using 
the following code for this purpose. In general this approach works fine. 
However, the R loop for applying this function makes it slow for very large 
data sets with millions of sequences.


Is there a method available for achieving the same result more efficiently? 

Considering the available resources in Biostrings, I am wondering if it 
wouldn't 
make a lot of sense of using Biostrings' sequence masking method along with the 
injectHardMask function for this purpose. However, at this point I am only 
aware 
of methods for applying masks efficiently to a single sequence in a XString 
object, 
but not to many sequences stored in a XStringSet object. 


#################
## Sample Code ##
#################
## Create example data set with random data
library(Biostrings)
dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 
20, replace=T), collapse=""))) 
myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create 
random Phred score list.
myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) 
myqual <- PhredQuality(myqual) 
dsetq1 <- QualityScaledDNAStringSet(dset, myqual) 

## Define in character inject function XStringSetInject     
XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20, 
mychar="N") { 
        mypos <- which(as.integer(myqual)<cutoff)
        return(toString(replaceLetterAt(myseq, 
which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
}
## Apply XStringSetInject function to all sequences.
dvec <- sapply(1:length(dset), function(x) XStringSetInject(myseq=dset[[x]], 
myqual=myqual[x], cutoff=20))
dset2 <- DNAStringSet(dvec)
names(dset2) <- paste("d", 1:length(dvec), sep="")
names(myqual) <- paste("d", 1:length(dvec), sep="")
dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) 

## Compare dsetq1 and dsetq2
dsetq1[1:2] 
dsetq2[1:2] 


> sessionInfo()
R version 2.9.0 (2009-04-17) 
x86_64-unknown-linux-gnu 

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.12.1 IRanges_1.2.0    

loaded via a namespace (and not attached):
[1] Biobase_2.4.0


Thanks in advance,

Thomas

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

[Bioc-sig-seq] Fast inject into XStringSet objects

Reply via email to