Re: [Bioc-sig-seq] Problem with ShortRead reading quality scores from bowtie

Fuad Gwadry Thu, 06 Aug 2009 05:57:58 -0700

Hi Martin,
 
Thank you for the clarifications. 
 
Another question regarding quality scores. The Solexa/Illumina 1.0 format 
encodes a quality score from -5 to 40 using ASCII 59 to 104 
(http://en.wikipedia.org/wiki/FASTQ_format#Quality). However, if you look below 
the the ASCII characters do not reflect that. I am actually using 
Solexa/Illumina 0.3 but my understanding is that it is the same format as 1.0.


 

Am I thinking about this correctly ?

 

Fuad


> alf
 [1] " "  "!"  "\"" "#"  "$"  "%"  "&"  "'"  "("  ")"  "*"  "+"  ","  "-"  "." 
[16] "/"  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  ":"  ";"  "<"  "=" 
[31] ">"  "?"  "@"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L" 
[46] "M"  "N"  "O"  "P"  "Q"  "R"  "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "[" 
[61] "\\" "]"  "^"  "_"  "`"  "a"  "b"  "c"  "d"  "e"  "f"  "g"  "h"  "i"  "j" 
[76] "k"  "l"  "m"  "n"  "o"  "p"  "q"  "r"  "s"  "t"  "u"  "v"  "w"  "x"  "y" 
[91] "z"  "{"  "|"  "}" 
> quality(aln)
class: SFastqQuality
quality:
  A BStringSet instance of length 4933275
          width seq
      [1]    32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
      [2]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      [3]    32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
      [4]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      [5]    32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
      [6]    32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      [7]    32 <<<<<<<<<<<<<<<<+<-<<<<;<<<<<<<<
      [8]    32 <<<<<<<<<<<<<<<<<<<<<<<<<,<+<<<<
      [9]    32 <<<<<<<<<<<<<<<<<<<5<<<<7<<<<<<2
      ...   ... ...
[4933267]    32 <<<<<<<<<<<<<<<<<<<<<<3<<8<<<<<<
[4933268]    32 <<<<<<<<<<<<<<<<<<<<<;<<<98<<<<<
[4933269]    32 <<<<<<<<<<<<<<<<</<<<5<<1;<;<7<<
[4933270]    32 <<<<<<<<<<<<<<<<<<<<<<<<;<<<<<6:
[4933271]    32 <<<<<<<<73;<;<<<<<;';57<;<0;<<55
[4933272]    32 <<<<<<3<<<<<<<<6<<<<<<-7<<<7*<<7
[4933273]    32 <<<<<<<<<<<<<3<<<<<<+<<6<<<4&5<6
[4933274]    32 <<<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<
[4933275]    32 <<<<<<<<<<<<<<<<<7<<;<8<<6<<<<<)

 
> Date: Wed, 5 Aug 2009 09:21:56 -0700
> From: [email protected]
> To: [email protected]
> CC: [email protected]; [email protected]; 
> [email protected]
> Subject: Re: [Bioc-sig-seq] Problem with ShortRead reading quality scores 
> from bowtie
> 
> Hi Faud --
> 
> Fuad Gwadry wrote:
> > Hi Martin.
> > 
> > My data is generated from the old pipeline (illumina 0.3) and 
> > therefore should not be a problem. In any case I tested your suggestion 
> > but it had no effect on the data. The colMeans(m) are the same before 
> > and after incorporating your suggestions. Also when I use alignQuality I 
> > get NAs in both cases. Let me know how I can help to trouble shoot this. 
> 
> NA's in alignQuality are present when the reads did not align; not all 
> of the values will be NA. see below for the issue with base quality scores.
> 
> > -- Fuad
> > 
> > The first part is using my original data:
> > 
> > > aln
> > class: AlignedRead
> > length: 4933275 reads; width: 32 cycles
> > chromosome: phiX174 phiX174 ... phiX174 phiX174
> > position: 5147 1513 ... 5034 3107
> > strand: - - ... + +
> > alignQuality: NumericQuality
> > alignData varLabels: similar mismatch
> > > head(quality(aln))
> > class: SFastqQuality
> > quality:
> > A BStringSet instance of length 6
> > width seq
> > [1] 32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
> > [2] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > [3] 32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
> > [4] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > [5] 32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
> > [6] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > m <- as(quality(aln), "matrix")
> > > colMeans(m)
> > [1] -3.865733 -3.867642 -3.878435 -3.869599 -3.884619 -3.891333 -3.904493
> > [8] -3.933454 -3.965951 -3.995224 -4.301351 -4.331125 -4.427614 -4.495761
> > [15] -4.556334 -4.586264 -4.623881 -4.653223 -4.744307 -4.825347 -4.913756
> > [22] -5.000141 -5.138716 -5.213431 -5.375823 -5.635962 -5.779450 -5.791400
> > [29] -5.944196 -6.175808 -6.368913 -6.597291
> > > alignQuality(aln)
> > class: NumericQuality
> > quality: NA NA ... NA NA (4933275 total)
> > 
> > ##################
> > incorporating your suggestions
> > ##################
> > 
> > > qual <- FastqQuality(quality(quality(aln)))
> > > qual
> > class: FastqQuality
> > quality:
> > A BStringSet instance of length 4933275
> > width seq
> > [1] 32 <<<<<<<<<<<<<<<<<;<<<;<<<<<5.<,:
> > [2] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > [3] 32 <<<<<<<<<<<<;<<;<<<<<<<+<<<<<<<<
> > [4] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > [5] 32 <<<<<<<<<<<<<<<<<<<;<<<<<<<<<<6<
> > [6] 32 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > [7] 32 <<<<<<<<<<<<<<<<+<-<<<<;<<<<<<<<
> > [8] 32 <<<<<<<<<<<<<<<<<<<<<<<<<,<+<<<<
> > [9] 32 <<<<<<<<<<<<<<<<<<<5<<<<7<<<<<<2
> > ... ... ...
> > [4933267] 32 <<<<<<<<<<<<<<<<<<<<<<3<<8<<<<<<
> > [4933268] 32 <<<<<<<<<<<<<<<<<<<<<;<<<98<<<<<
> > [4933269] 32 <<<<<<<<<<<<<<<<</<<<5<<1;<;<7<<
> > [4933270] 32 <<<<<<<<<<<<<<<<<<<<<<<<;<<<<<6:
> > [4933271] 32 <<<<<<<<73;<;<<<<<;';57<;<0;<<55
> > [4933272] 32 <<<<<<3<<<<<<<<6<<<<<<-7<<<7*<<7
> > [4933273] 32 <<<<<<<<<<<<<3<<<<<<+<<6<<<4&5<6
> > [4933274] 32 <<<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<
> > [4933275] 32 <<<<<<<<<<<<<<<<<7<<;<8<<6<<<<<)
> > > initialize(aln, quality=qual)
> > class: AlignedRead
> > length: 4933275 reads; width: 32 cycles
> > chromosome: phiX174 phiX174 ... phiX174 phiX174
> > position: 5147 1513 ... 5034 3107
> > strand: - - ... + +
> > alignQuality: NumericQuality
> > alignData varLabels: similar mismatch
> 
> The statement above should be
> 
> aln <- initialize(aln, quality=qual)
> 
> i.e., create a new instance and save it to the variable aln. Save it to 
> a different variable if you want to avoid over-writing aln.
> 
> Martin
> 
> > > m <- as(quality(aln), "matrix")
> > > colMeans(m)
> > [1] -3.865733 -3.867642 -3.878435 -3.869599 -3.884619 -3.891333 -3.904493
> > [8] -3.933454 -3.965951 -3.995224 -4.301351 -4.331125 -4.427614 -4.495761
> > [15] -4.556334 -4.586264 -4.623881 -4.653223 -4.744307 -4.825347 -4.913756
> > [22] -5.000141 -5.138716 -5.213431 -5.375823 -5.635962 -5.779450 -5.791400
> > [29] -5.944196 -6.175808 -6.368913 -6.597291
> > > alignQuality(aln)
> > class: NumericQuality
> > quality: NA NA ... NA NA (4933275 total)
> > 
> > > sessionInfo()
> > R version 2.9.1 (2009-06-26)
> > x86_64-unknown-linux-gnu
> > locale:
> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base 
> > other attached packages:
> > [1] ShortRead_1.3.22 lattice_0.17-25 BSgenome_1.13.10 
> > Biostrings_2.13.29
> > [5] IRanges_1.3.44 
> > loaded via a namespace (and not attached):
> > [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1 
> > 
> > 
> > 
> > > Date: Tue, 4 Aug 2009 09:07:51 -0700
> > > From: [email protected]
> > > To: [email protected]
> > > CC: [email protected]; [email protected]; 
> > [email protected]
> > > Subject: Re: [Bioc-sig-seq] Problem with ShortRead reading quality 
> > scores from bowtie
> > >
> > > Martin Morgan wrote:
> > > > Fuad Gwadry wrote:
> > > >> Hi All
> > > >>
> > > >>
> > > >>
> > > >> I am getting negative values when reading quality scores when I read
> > > >> data generated in bowtie. Has anyone run into the same issue when
> > > >> using data generated by bowtie ? My session info is below.
> > > >
> > > > Hi Fuad -- ShortRead is reading the quality scores on the wrong scale
> > > > (solexa, rather than phred; this will be fixed before the next 
> > release).
> > > > Try
> > > >
> > > > qual <- FastqQuality(quality(quality(aln))
> > > > initialize(aln, quality=qual)
> > > >
> > > > to update aln, or
> > > >
> > > > m <- as(FastqQuality(quality(quality(aln)), "matrix")
> > > >
> > > > for a one-off solution.
> > >
> > > I wanted to clarify, too, both for this post and one yesterday, that
> > > as() is simply converting the character encoding to the corresponding
> > > integer value that each letter encodes; there is a secondary mapping
> > > from this encoding to log-odds or phred score that is not being
> > > performed. This step is, I think
> > >
> > > 10^(-m/10) for phred scores
> > > 1 - 1 / (1 + 10^(-m/10)) for Solexa scores
> > >
> > > Solexa has changed its encoding scheme very recently; I think it is now
> > > standard phred but am not sure.
> > >
> > > Martin
> > >
> > > >
> > > > Martin
> > > >
> > > >>
> > > >>
> > > >>
> > > >> Thanks in advance
> > > >>
> > > >>
> > > >>
> > > >> Fuad
> > > >>
> > > >>
> > > >>
> > > >>> aln
> > > >> class: AlignedRead
> > > >> length: 4591807 reads; width: 32 cycles
> > > >> chromosome: chr13 chr7 ... chr6 chr4 position: 93437004 13223395 ...
> > > >> 23636747 23353864 strand: - - ... + + alignQuality: NumericQuality
> > > >> alignData varLabels: similar mismatch
> > > >>
> > > >>> m <- as(quality(aln), "matrix")
> > > >>> colMeans(m)
> > > >> [1] -7.186638 -7.205858 -7.203382 -7.197175 -7.203629 -7.217016
> > > >> [7] -7.240661 -7.249238 -7.268499 -7.286551 -7.306615 -7.324003
> > > >> [13] -7.523238 -7.581242 -7.697591 -7.695861 -7.735321 -7.743323
> > > >> [19] -7.752996 -7.849403 -7.862658 -7.931969 -7.979778 -8.029288
> > > >> [25] -8.120469 -8.215818 -8.335176 -8.411609 -8.587005 -8.820979
> > > >> [31] -11.447326 -11.644198
> > > >>
> > > >>
> > > >>> sessionInfo()
> > > >> R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu
> > > >> locale:
> > > >> 
> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >  
> > 
> > > >>
> > > >>
> > > >> attached base packages:
> > > >> [1] stats graphics grDevices utils datasets methods base
> > > >> other attached packages:
> > > >> [1] ShortRead_1.3.22 lattice_0.17-25 BSgenome_1.13.10
> > > >> Biostrings_2.13.29
> > > >> [5] IRanges_1.3.44
> > > >> loaded via a namespace (and not attached):
> > > >> [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1
> > > >>
> > > >> _________________________________________________________________
> > > >> More storage. Better anti-spam and antivirus protection. Hotmail 
> > makes
> > > >> it simple.
> > > >>
> > > >> [[alternative HTML version deleted]]
> > > >>
> > > >> _______________________________________________
> > > >> Bioc-sig-sequencing mailing list
> > > >> [email protected]
> > > >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> > > >
> > > >
> > >
> > >
> > > --
> > > Martin Morgan
> > > Computational Biology / Fred Hutchinson Cancer Research Center
> > > 1100 Fairview Ave. N.
> > > PO Box 19024 Seattle, WA 98109
> > >
> > > Location: Arnold Building M1 B861
> > > Phone: (206) 667-2793
> > 
> > ------------------------------------------------------------------------
> > Attention all humans. We are your photos. Free us. 
> > <http://go.microsoft.com/?linkid=9666046>
> 
> 
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793


_________________________________________________________________
Send and receive email from all of your webmail accounts.

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Problem with ShortRead reading quality scores from bowtie

Reply via email to