Thomas,

Thomas Girke wrote:
Dear Herve,

Thanks a lot for your suggestions and efforts to optimize this step!

We are often using sequences of variable length (e.g. after adaptor trimming), but this shouldn't be a problem because we can simply loop over every length interval (usually less than 100 iterations!).

Yes I could extend replaceLetterAt() so it supports DNAStringSet objects
of arbitrary shape. In that case, two formats for 'at' would be natural:
"list of logical vectors" or "list of integer vectors". Then you could do
something like this:

(1) Trim the adaptor:

      trimming <- trimLRPatterns(..., dset, ranges=TRUE)
      trimmed_dset <- narrow(dset, start=start(trimming), end=end(trimming))
      trimmed_myqual <- narrow(myqual,
                               start=start(trimming),
                               end=end(trimming))

(2) Build the 'at' object to be used in replaceLetterAt():

      flat_quals <- charToRaw(as.character(unlist(trimmed_myqual)))
      flat_at <- flat_quals < charToRaw(as.character(PhredQuality(20L)))
      at <- split(flat_at,
                  rep.int(seq_len(length(trimmed_myqual)),
                  width(trimmed_myqual)))

(3) Build the 'letter' object to be used in replaceLetterAt():

      letter_subject <-
          DNAString(paste(rep.int("N", max(width(trimmed_dset))), collapse=""))
      letter <- as(Views(letter_subject, start=1, end=sapply(at, sum)),
                   "DNAStringSet")

(4) Call replaceLetterAt():

      dset4 <- replaceLetterAt(trimmed_dset, at, letter)

Also some convenience could be added to Biostrings to support this use case
by wrapping some of the steps above into utility functions e.g.
 o (1) could be done in a single call if we had a "trimLRPatterns"
   method for QualityScaledDNAStringSet objects (which is easy
   to add),
 o (2) and (3) could be wrapped too but I don't have good names for
   the wrappers and am not sure about their arguments either. I would
   need to think a little bit about it.

What do you think? Would that work?


One short question:
Are there any plans to support masks in the future on XStringSet
objects? If one could define and apply masks to sequences of variable
length in XStringSet objects (without loops) then this would be be very
useful for many applications.

No plans to support *soft* masking of XStringSet objects in the near
future because it's a too complicated thing to put in place, at least
for now. However *hard* masking (i.e. alteration of an XStringSet object
by injection of the "-" letter at some locations) is much easier to
support because 99% of the times you don't need to touch the implementation
of the functions that operate on XStringSet objects to make them work
on a hard-masked XStringSet object. They just keep working just fine.
The downside of hard masking is that every time the hard mask needs to be
modified, you need to make a new copy of the XStringSet object. But we
first need to see use cases where this becomes really an issue before
we start working on *soft* masking of XStringSet objects.

Cheers,
H.



Many thanks again,
Thomas

On Tue, May 19, 2009 at 11:17:16AM -0700, Hervé Pagès wrote:
Hi Thomas,

Sorry for the slow response on this.

Here is a much faster solution to your problem. Depending on the
size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
faster than the sapply-based solution. Let's assume 'dset' is the
constant width DNAStringSet object containing your reads and 'myqual'
is the constant width PhredQuality object containing the qualities
of your reads.
The following solution takes advantage of the fact that 'dset' and
therefore 'myqual' are rectangular, hence 'myqual' can be turned into
a raw matrix pretty efficiently:

  myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
                       nrow=length(myqual), byrow=TRUE)

This raw matrix itself can be turned into a boolean matrix using some
cutoff value:

  at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))

Also I've just added a "replaceLetterAt" method for DNAStringSet objects
to Biostrings (in addition to the existing method for DNAString objects).
Currently, it has the limitation that 'x' and 'at' must be rectangular
i.e. 'x' must have a constant width and 'at' must be a logical matrix
with the same dimensions as 'x'. That's of course just what's needed to
handle your usecase:

letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]), collapse="")) letter <- as(Views(letter_subject, start=1, end=rowSums(at)), "DNAStringSet")
  dset3 <- replaceLetterAt(dset, at, letter)

'dset3' is the same as the 'dset2' object you obtained with the sapply-based
solution.

This improved replaceLetterAt() function will be available in Biostrings
release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the public
repositories (in the next 24 hours).

Please let me know if you have any other concerns.

Cheers,
H.


Thomas Girke wrote:
Dear Developers,

Is there an efficient method available for replacing base calls in sequences by Ns (or any other character) at positions where the quality score drops below some arbitrary threshold. So far we have been using the following code for this purpose. In general this approach works fine. However, the R loop for applying this function makes it slow for very large data sets with millions of sequences. Is there a method available for achieving the same result more efficiently? Considering the available resources in Biostrings, I am wondering if it wouldn't make a lot of sense of using Biostrings' sequence masking method along with the injectHardMask function for this purpose. However, at this point I am only aware of methods for applying masks efficiently to a single sequence in a XString object, but not to many sequences stored in a XStringSet object.
#################
## Sample Code ##
#################
## Create example data set with random data
library(Biostrings)
dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random Phred score list. myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) myqual <- PhredQuality(myqual) dsetq1 <- QualityScaledDNAStringSet(dset, myqual) ## Define in character inject function XStringSetInject XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20, mychar="N") { mypos <- which(as.integer(myqual)<cutoff) return(toString(replaceLetterAt(myseq, which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
}
## Apply XStringSetInject function to all sequences.
dvec <- sapply(1:length(dset), function(x) XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
dset2 <- DNAStringSet(dvec)
names(dset2) <- paste("d", 1:length(dvec), sep="")
names(myqual) <- paste("d", 1:length(dvec), sep="")
dsetq2 <- QualityScaledDNAStringSet(dset2, myqual)
## Compare dsetq1 and dsetq2
dsetq1[1:2] dsetq2[1:2]

sessionInfo()
R version 2.9.0 (2009-04-17) x86_64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Biostrings_2.12.1 IRanges_1.2.0
loaded via a namespace (and not attached):
[1] Biobase_2.4.0


Thanks in advance,

Thomas

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: [email protected]
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to