Re: [Bioc-sig-seq] Confused on fastq to numeric conversion (SolexaPipeline 1.3+)

Leonardo Collado Torres Thu, 26 Nov 2009 19:28:08 -0800

Hello Martin and Harris,

Thank you for your great, very complete and extremely fast replies!! ^_^


Greetings,
Leonardo

PD Just CCing so the replies get archived on the mailing list for laternewbies like me ;)


Martin Morgan wrote:

Harris A. Jaffee wrote:

Perhaps this is a manifestation of those silent choices:

as(quality(reads[1])[[1]], "numeric")

 [1] 97 97 97 96 92 97 97 97 95 96 95 96 97 97 97 97 97 97 94 88 97 96
91 68 90
[26] 96 96 97 96 96 97 96 96 96 97 96 94 96 97 97 97 83 96 92 93 96 96
92 92 97


It's not really obvious to me why quality(<...>)[[1]] extracts a
BStringSet. But once it does, the BStringSet is convert to numeric as
appropriate for a BString, which turns out to be the ASCII encoding.

as(quality(reads[1]), "numeric")

 [1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63
58 35 57
[26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
59 59 64


quality(<...>) is a SFastqQuality or FastqQuality object, and is
converted to numeric as appropriate for short read qualities.

I am less than a novice in ShortRead, so I don't feel adequate enough to
post to
the world.  Barely adequate enough to make a private guess.


You're too modest, Harris.

Martin

On Nov 26, 2009, at 6:13 PM, Martin Morgan wrote:

Hi Leonardo --

Leonardo Collado Torres wrote:

Dear list,

I'm puzzled over something very simple. I've got some fresh Illumina
data and I want to make a plot of the median (10% quantile, etc)
qualities (Y axis) per cycle (X axis).

On the code below, I'm simply reading a test file in fastq format,
printing the quality, checking that nothing weird went when reading the
file (could be :P), and changing the ASCII qualities to numeric values.
The symbol "D" gets changed to 35; the lowest value on the 1st sequence.

library(ShortRead)
reads <- readFastq(dirPath = ".", pattern = "test.txt")

reads[1]

class: ShortReadQ
length: 1 reads; width: 50 cycles

sread(reads[1])

 A DNAStringSet instance of length 1
   width seq
[1]    50 ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG

quality(reads[1])

class: FastqQuality
quality:
 A BStringSet instance of length 1
   width seq
[1]    50 aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a

system("head -n 4 test.txt")

@HWUSI-EAS636_0001:8:1:1:1353#0/1
ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
+HWUSI-EAS636_0001:8:1:1:1353#0/1
aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a

as(quality(reads[1]), "numeric") # Will use "matrix" when using all

reads instead of the 1st one
[1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63 58
35 57
[26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
59 59 64

summary(as(quality(reads[1]), "numeric"))

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  35.0    62.0    63.0    61.7    64.0    64.0

Now, what confuses me is that according to

http://en.wikipedia.org/wiki/FASTQ_format SolexaPipeline 1.3+ "can
encode a Phred quality score from 0 to 62 using ASCII 64 to 126
(although in raw read data Phred scores from 0 to 40 only are
expected)". Then, on the ASCII table at wiki
(http://en.wikipedia.org/wiki/ASCII) glyph (I called it symbol earlier)
"D" corresponds to 68 decimal, which should be a Phred quality score of
6 but its 35 in R. Glyph "a" is 97 decimal, which should be 35 but is 64
in R.

fastq files don't contain enough information to know whether the data
are encoded with ASCII 64 == 1 (Solexa-style) or ASCII 32 == 1
('classic'-style). There is an argument qualityType that you can use to
specify which, so

  readFastq("test.txt", qualityType="SFastqQuality")

for Solexa encoding. Note that to read a single file it is no longer
necessary to specify both directory and pattern. And finally that
as(<...>, "matrix") etc convert from the ASCII to integer encoding;
there is still a transform from integer to p value, if that is desired,
and that there are several possibilities for that transformation as well.

Martin

So, do I substract 29 to all the values I get in R or am I missing
something?

Thank you and enjoy Thanksgiving if you celebrate it,
Leonardo

sessionInfo()

R version 2.10.0 (2009-10-26)
x86_64-unknown-linux-gnu

locale:
[1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                 [3]
LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915   [5]
LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915  [7]
LC_PAPER=en_US.iso885915       LC_NAME=C                    [9]
LC_ADDRESS=C                   LC_TELEPHONE=C               [11]
LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:

[1] ShortRead_1.4.0 lattice_0.17-26 BSgenome_1.14.0Biostrings_2.14.0

[5] IRanges_1.4.0
loaded via a namespace (and not attached):
[1] Biobase_2.6.0 grid_2.10.0   hwriter_1.1   tools_2.10.0

--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Leonardo Collado Torres, Bachelor in Genomic Sciences
Teacher at LCG and member of Dr. Enrique Morett's lab
UNAM Campus Cuernavaca, Mexico

Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Confused on fastq to numeric conversion (SolexaPipeline 1.3+)

Reply via email to