Re: [Bioc-sig-seq] Quality Value Analysis from a BStringSet

Marcus Davy Sun, 13 Jun 2010 20:50:46 -0700

Hi Martin,
thanks for your comments.

With the quality score conversion I was getting different quality scores
using 'as(quality(fq), "matrix")' which would need to be offset by 31
because of Phred/Solexa score differences.


On Sat, Jun 12, 2010 at 1:55 AM, Martin Morgan <[email protected]> wrote:

> On 06/11/2010 12:48 PM, Marcus Davy wrote:
> > Hi,
> > there are a couple of ways to have a go at this, for example pattern
> match
> > for B's and count the matches using low level matching or coerce to
> numeric
> > quality scores and summarize those. I have provided an example where I
> >  subset the ShortReadQ object to work with the 3' end of interest.
> However,
> > if you do this, qa()  does not take into account that you may have
> changed
> > the starting cycle position to something other than 1. It would be quite
> > nice to have an additional argument  in qa() which sets an alternative
> start
> > cycle, for example if you ran a quality report on the last 20 cycles of
> > interest when there are 100 cycles, your qa() summary output should start
> at
> > cycle 81 to 100.
>
> One thing is that the subset might often and reasonably be ragged, e.g.,
> 'the last 20 nucleotides of a 454 run', where the reads are different
> widths.


Yes I agree with you, the context I was talking about was only for fixed
width Solexa data.
Incidentally sometime ago I did some tests randomly generating variable
length fastq data for 454. The qa report failed slightly on one component,
what was happening in my case (partly due to random generation of data) was
that I was getting all unique sequences in qa[["sequenceDistribution"]]
which was causing plotReadOccurrences to fail, probably the diff() in the
denominator is ultimately causing an infinite value. I think this is
unlikely to be an issue in practice with sequence artifacts like duplicate
sequences etc, maybe the likelihood could increase as the length of reads
increases for 454 ragged reads.


> >
> > Another issue, is that ShortRead:::.plotCycleQuality only plots the means
> > for each quality score versus cycle position. This does not provide any
> > information about the variability in those quality scores for each cycle
> > which usually increases as the cycle number increases.
>
> It's hard to know how to do this when the number of cycles gets large,
> without just writing a lot of ink to the plot. Maybe every 5th / 10th
> cycle might have quantile information? (with the qa object containing
> the info for all cycles).
>
> In  fastx boxplot graphs I have sighted there is consistent quantile
information after a certain cycle position depending on the experiment.
Using boxplots there also appears to be issues for high quality scores with
low variability which may not have any interquartile range width.

## Variable length example
library(ShortRead)
exptPath   <- system.file("extdata", package = "ShortRead")
sp         <- SolexaPath(exptPath)
fqpattern  <- "s_1_sequence.txt"
fl         <- file.path(analysisPath(sp), fqpattern)
fq         <- readFastq(sp, fqpattern)

## Example removing duplicates to illustrate error
set.seed(1)
fq.ragged <- narrow(fq[!srduplicated(fq)], 1, sample(10:30, width(fq),
replace=TRUE))
qaSummary  <- qa(fq.ragged, lane="A")

## Report fails
reportName <- report(qaSummary)
browseURL(reportName)

df <- qaSummary[["sequenceDistribution"]]
## Input df only having one row as all sequences are unique
print(df)
## plotreadOcurances failing
ShortRead:::.plotReadOccurrences(df[df$type=="read",], cex=.5)


## Quality offset Differs due to confusion around Phred/Solexa scores and
ACSII offet
identical(as(SFastqQuality(quality(quality(fq))), "matrix"), as(quality(fq),
"matrix"))

head(as(SFastqQuality(quality(quality(fq))), "matrix"),1)
head(as(quality(fq), "matrix"),1) - 31 ## 64-32-1 offset

## Skewed scores in boxplot example with no interquartile range - fastx
doesn't plot outliers
boxplot(rep(c(40, 39,38, 37), c(10000,15,5,2)))

cheers,


Marcus


> I have investigated using a range of quantile scores to provide additional
> > distributional information, however this did not visualize particularly
> > well.  Boxplots for each cycle position seem to be a better way to
> summarize
> > quality data. Examples of these are created using the fastx toolkit (
> > http://hannonlab.cshl.edu/fastx_toolkit/).
> >
> > It would be nice to have function that can produce plotCycleQuality
> > Boxplots however this does not appear to be straightforward from summary
> > information derived from qa(). I would be interested if anyone has had a
> go
> > at doing this, I was thinking of an approach taking relevant quality
> > information out of qa() and making use of Rle objects to obtain summary
> > quantile statistics and then make boxplots from first principles using a
> > lattice panel function based on lattice::panel.bwplot.
> >
> >
> > library(ShortRead)
> > exptPath   <- system.file("extdata", package = "ShortRead")
> > sp         <- SolexaPath(exptPath)
> > fqpattern  <- "s_1_sequence.txt"
> > fl         <- file.path(analysisPath(sp), fqpattern)
> > fq         <- readFastq(sp, fqpattern)
> >
> >
> > w <- width(fq[1])
> > l <- 5
> > ## Subset of fq for the last 5 cycles
> > fqs <- new("ShortReadQ", sread = DNAStringSet(sread(fq), (w-l+1), NA),
> > quality = SFastqQuality(BStringSet(quality(quality(fq)),
> >         (w-l+1), NA)), id = id(fq))
>
> Better to call the constructor ShortReadQ(sread=...) than 'new'
> directly. Better still to
>
>  narrow(fq, w-l+1, NA)
>
> (here to narrow an object with variable width reads one might
>
>  narrow(fq, width(fq)-l+1, NA)
>
> >
> > ## Sanity check
> > quality(quality(fqs))
> >
> > ## Low level matching for pattern B
> > pattern <- "B"
> > for(i in 1:l) {
> >   cat("[ searching for", pattern, "at position", i, "using hasLetterAt
> ]\n")
> >   print(sum(hasLetterAt(quality(quality(fqs)), "X", i)))
> >   cat("[ searching for", pattern, "at position", i, "using isMatchingAt
> > ]\n")
> >   print(sum(isMatchingAt("X", quality(quality(fqs)), i)))
> > }
> >
> > ## Extracting scores matrix of quality scores
> > scores <- as(SFastqQuality(quality(quality(fq))), "matrix")[, (w-l+1):w]
>
> already an SFastqQuality, so just as(quality(fq), "matrix").
>
> Thanks for the ideas. Martin
>
> > summary(scores)
> >
> > ## Run a qa report on a subset of fq
> > qaSummary  <- qa(fqs, lane="A")
> > reportName <- report(qaSummary)
> > browseURL(reportName)
> >
> > ## perCycle quality plot -note start cycle incorrect
> > perCycle <- qaSummary[["perCycle"]]
> > ShortRead:::.plotCycleQuality(perCycle$quality)
> >
> >
> >
> > Marcus
> >
> >
> > On Fri, Jun 4, 2010 at 8:04 AM, Steve Lianoglou <
> > [email protected]> wrote:
> >
> >> Hi,
> >>
> >> On Thu, Jun 3, 2010 at 3:39 PM, Pratap, Abhishek
> >> <[email protected]> wrote:
> >>> Hi All
> >>>
> >>> I would like to extract and count the last 5 quality values from the
> >> FASTQ file. I have read the file using "readFastq" and have stored the
> >> quality values as a BStringSet.
> >>>
> >>> Eg :
> >>> A BStringSet instance of length 5119916
> >>>          width seq
> >>>      [1]    75
> >> BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >>>      [2]    75
> >> bbbbbbbbbbbbabbbbbb`bbbbbbab`b_...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >>>      [3]    75
> >> aaaaaaa_aaaaO`aa^aaa_a_T_``^[`S...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >>>      [4]    75
> >> bbbbbbbbbbbbaabbbb`bbb_Uaa___BB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >>>      [5]    75
> >> ``a`aa`aaYaTaaaBBBBBBBBBBBBBBBB...BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> >>>
> >>> What I would like to do is subseq the last 5 quality values and do a
> >> count on #B. We suspect despite good avg quality we still have HIGH bad
> >> bases at the end of reads.
> >>>
> >>> Any other ideas welcome.
> >>
> >> How about just plotting the average quality score at each base
> >> position by doing something like:
> >>
> >> 1. Converting your phred score BStringSet into a matrix of its numeric
> >> values
> >> 2. Plotting the colMeans(...) of that matrix.
> >>
> >> Maybe?
> >>
> >> --
> >> Steve Lianoglou
> >> Graduate Student: Computational Systems Biology
> >>  | Memorial Sloan-Kettering Cancer Center
> >>  | Weill Medical College of Cornell University
> >> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >>
> >> _______________________________________________
> >> Bioc-sig-sequencing mailing list
> >> [email protected]
> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >>
> >
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
'

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Quality Value Analysis from a BStringSet

Reply via email to