Re: [Bioc-sig-seq] Quality Value Analysis from a BStringSet

Marcus Davy Fri, 11 Jun 2010 03:48:56 -0700

Hi,
there are a couple of ways to have a go at this, for example pattern match
for B's and count the matches using low level matching or coerce to numeric
quality scores and summarize those. I have provided an example where I
 subset the ShortReadQ object to work with the 3' end of interest. However,
if you do this, qa()  does not take into account that you may have changed
the starting cycle position to something other than 1. It would be quite
nice to have an additional argument  in qa() which sets an alternative start
cycle, for example if you ran a quality report on the last 20 cycles of
interest when there are 100 cycles, your qa() summary output should start at
cycle 81 to 100.


Another issue, is that ShortRead:::.plotCycleQuality only plots the means
for each quality score versus cycle position. This does not provide any
information about the variability in those quality scores for each cycle
which usually increases as the cycle number increases.

I have investigated using a range of quantile scores to provide additional
distributional information, however this did not visualize particularly
well.  Boxplots for each cycle position seem to be a better way to summarize
quality data. Examples of these are created using the fastx toolkit (
http://hannonlab.cshl.edu/fastx_toolkit/).

It would be nice to have function that can produce plotCycleQuality
Boxplots however this does not appear to be straightforward from summary
information derived from qa(). I would be interested if anyone has had a go
at doing this, I was thinking of an approach taking relevant quality
information out of qa() and making use of Rle objects to obtain summary
quantile statistics and then make boxplots from first principles using a
lattice panel function based on lattice::panel.bwplot.


library(ShortRead)
exptPath   <- system.file("extdata", package = "ShortRead")
sp         <- SolexaPath(exptPath)
fqpattern  <- "s_1_sequence.txt"
fl         <- file.path(analysisPath(sp), fqpattern)
fq         <- readFastq(sp, fqpattern)


w <- width(fq[1])
l <- 5
## Subset of fq for the last 5 cycles
fqs <- new("ShortReadQ", sread = DNAStringSet(sread(fq), (w-l+1), NA),
quality = SFastqQuality(BStringSet(quality(quality(fq)),
        (w-l+1), NA)), id = id(fq))

## Sanity check
quality(quality(fqs))

## Low level matching for pattern B
pattern <- "B"
for(i in 1:l) {
  cat("[ searching for", pattern, "at position", i, "using hasLetterAt ]\n")
  print(sum(hasLetterAt(quality(quality(fqs)), "X", i)))
  cat("[ searching for", pattern, "at position", i, "using isMatchingAt
]\n")
  print(sum(isMatchingAt("X", quality(quality(fqs)), i)))
}

## Extracting scores matrix of quality scores
scores <- as(SFastqQuality(quality(quality(fq))), "matrix")[, (w-l+1):w]
summary(scores)

## Run a qa report on a subset of fq
qaSummary  <- qa(fqs, lane="A")
reportName <- report(qaSummary)
browseURL(reportName)

## perCycle quality plot -note start cycle incorrect
perCycle <- qaSummary[["perCycle"]]
ShortRead:::.plotCycleQuality(perCycle$quality)



Marcus


On Fri, Jun 4, 2010 at 8:04 AM, Steve Lianoglou <
[email protected]> wrote:

> Hi,
>
> On Thu, Jun 3, 2010 at 3:39 PM, Pratap, Abhishek
> <[email protected]> wrote:
> > Hi All
> >
> > I would like to extract and count the last 5 quality values from the
> FASTQ file. I have read the file using "readFastq" and have stored the
> quality values as a BStringSet.
> >
> > Eg :
> > A BStringSet instance of length 5119916
> >          width seq
> >      [1]    75
> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >      [2]    75
> bbbbbbbbbbbbabbbbbb`bbbbbbab`b_...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >      [3]    75
> aaaaaaa_aaaaO`aa^aaa_a_T_``^[`S...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >      [4]    75
> bbbbbbbbbbbbaabbbb`bbb_Uaa___BB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >      [5]    75
> ``a`aa`aaYaTaaaBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >
> > What I would like to do is subseq the last 5 quality values and do a
> count on #B. We suspect despite good avg quality we still have HIGH bad
> bases at the end of reads.
> >
> > Any other ideas welcome.
>
> How about just plotting the average quality score at each base
> position by doing something like:
>
> 1. Converting your phred score BStringSet into a matrix of its numeric
> values
> 2. Plotting the colMeans(...) of that matrix.
>
> Maybe?
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Quality Value Analysis from a BStringSet

Reply via email to