Dear Developers,
Is there an efficient method available for replacing base calls in
sequences by Ns (or any other character) at positions where the quality
score drops below some arbitrary threshold. So far we have been using
the following code for this purpose. In general this approach works fine.
However, the R loop for applying this function makes it slow for very large
data sets with millions of sequences.
Is there a method available for achieving the same result more efficiently?
Considering the available resources in Biostrings, I am wondering if it
wouldn't
make a lot of sense of using Biostrings' sequence masking method along with the
injectHardMask function for this purpose. However, at this point I am only
aware
of methods for applying masks efficiently to a single sequence in a XString
object,
but not to many sequences stored in a XStringSet object.
#################
## Sample Code ##
#################
## Create example data set with random data
library(Biostrings)
dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"),
20, replace=T), collapse="")))
myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create
random Phred score list.
myqual <- sapply(myqlist, function(x) toString(PhredQuality(x)))
myqual <- PhredQuality(myqual)
dsetq1 <- QualityScaledDNAStringSet(dset, myqual)
## Define in character inject function XStringSetInject
XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20,
mychar="N") {
mypos <- which(as.integer(myqual)<cutoff)
return(toString(replaceLetterAt(myseq,
which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
}
## Apply XStringSetInject function to all sequences.
dvec <- sapply(1:length(dset), function(x) XStringSetInject(myseq=dset[[x]],
myqual=myqual[x], cutoff=20))
dset2 <- DNAStringSet(dvec)
names(dset2) <- paste("d", 1:length(dvec), sep="")
names(myqual) <- paste("d", 1:length(dvec), sep="")
dsetq2 <- QualityScaledDNAStringSet(dset2, myqual)
## Compare dsetq1 and dsetq2
dsetq1[1:2]
dsetq2[1:2]
> sessionInfo()
R version 2.9.0 (2009-04-17)
x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.12.1 IRanges_1.2.0
loaded via a namespace (and not attached):
[1] Biobase_2.4.0
Thanks in advance,
Thomas
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing