On 06/11/2010 12:48 PM, Marcus Davy wrote: > Hi, > there are a couple of ways to have a go at this, for example pattern match > for B's and count the matches using low level matching or coerce to numeric > quality scores and summarize those. I have provided an example where I > subset the ShortReadQ object to work with the 3' end of interest. However, > if you do this, qa() does not take into account that you may have changed > the starting cycle position to something other than 1. It would be quite > nice to have an additional argument in qa() which sets an alternative start > cycle, for example if you ran a quality report on the last 20 cycles of > interest when there are 100 cycles, your qa() summary output should start at > cycle 81 to 100.
One thing is that the subset might often and reasonably be ragged, e.g., 'the last 20 nucleotides of a 454 run', where the reads are different widths. > > Another issue, is that ShortRead:::.plotCycleQuality only plots the means > for each quality score versus cycle position. This does not provide any > information about the variability in those quality scores for each cycle > which usually increases as the cycle number increases. It's hard to know how to do this when the number of cycles gets large, without just writing a lot of ink to the plot. Maybe every 5th / 10th cycle might have quantile information? (with the qa object containing the info for all cycles). > I have investigated using a range of quantile scores to provide additional > distributional information, however this did not visualize particularly > well. Boxplots for each cycle position seem to be a better way to summarize > quality data. Examples of these are created using the fastx toolkit ( > http://hannonlab.cshl.edu/fastx_toolkit/). > > It would be nice to have function that can produce plotCycleQuality > Boxplots however this does not appear to be straightforward from summary > information derived from qa(). I would be interested if anyone has had a go > at doing this, I was thinking of an approach taking relevant quality > information out of qa() and making use of Rle objects to obtain summary > quantile statistics and then make boxplots from first principles using a > lattice panel function based on lattice::panel.bwplot. > > > library(ShortRead) > exptPath <- system.file("extdata", package = "ShortRead") > sp <- SolexaPath(exptPath) > fqpattern <- "s_1_sequence.txt" > fl <- file.path(analysisPath(sp), fqpattern) > fq <- readFastq(sp, fqpattern) > > > w <- width(fq[1]) > l <- 5 > ## Subset of fq for the last 5 cycles > fqs <- new("ShortReadQ", sread = DNAStringSet(sread(fq), (w-l+1), NA), > quality = SFastqQuality(BStringSet(quality(quality(fq)), > (w-l+1), NA)), id = id(fq)) Better to call the constructor ShortReadQ(sread=...) than 'new' directly. Better still to narrow(fq, w-l+1, NA) (here to narrow an object with variable width reads one might narrow(fq, width(fq)-l+1, NA) > > ## Sanity check > quality(quality(fqs)) > > ## Low level matching for pattern B > pattern <- "B" > for(i in 1:l) { > cat("[ searching for", pattern, "at position", i, "using hasLetterAt ]\n") > print(sum(hasLetterAt(quality(quality(fqs)), "X", i))) > cat("[ searching for", pattern, "at position", i, "using isMatchingAt > ]\n") > print(sum(isMatchingAt("X", quality(quality(fqs)), i))) > } > > ## Extracting scores matrix of quality scores > scores <- as(SFastqQuality(quality(quality(fq))), "matrix")[, (w-l+1):w] already an SFastqQuality, so just as(quality(fq), "matrix"). Thanks for the ideas. Martin > summary(scores) > > ## Run a qa report on a subset of fq > qaSummary <- qa(fqs, lane="A") > reportName <- report(qaSummary) > browseURL(reportName) > > ## perCycle quality plot -note start cycle incorrect > perCycle <- qaSummary[["perCycle"]] > ShortRead:::.plotCycleQuality(perCycle$quality) > > > > Marcus > > > On Fri, Jun 4, 2010 at 8:04 AM, Steve Lianoglou < > [email protected]> wrote: > >> Hi, >> >> On Thu, Jun 3, 2010 at 3:39 PM, Pratap, Abhishek >> <[email protected]> wrote: >>> Hi All >>> >>> I would like to extract and count the last 5 quality values from the >> FASTQ file. I have read the file using "readFastq" and have stored the >> quality values as a BStringSet. >>> >>> Eg : >>> A BStringSet instance of length 5119916 >>> width seq >>> [1] 75 >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> [2] 75 >> bbbbbbbbbbbbabbbbbb`bbbbbbab`b_...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> [3] 75 >> aaaaaaa_aaaaO`aa^aaa_a_T_``^[`S...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> [4] 75 >> bbbbbbbbbbbbaabbbb`bbb_Uaa___BB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> [5] 75 >> ``a`aa`aaYaTaaaBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB >>> >>> What I would like to do is subseq the last 5 quality values and do a >> count on #B. We suspect despite good avg quality we still have HIGH bad >> bases at the end of reads. >>> >>> Any other ideas welcome. >> >> How about just plotting the average quality score at each base >> position by doing something like: >> >> 1. Converting your phred score BStringSet into a matrix of its numeric >> values >> 2. Plotting the colMeans(...) of that matrix. >> >> Maybe? >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
