Hello Martin and Harris,
Thank you for your great, very complete and extremely fast replies!! ^_^
Greetings,
Leonardo
PD Just CCing so the replies get archived on the mailing list for later
newbies like me ;)
Martin Morgan wrote:
Harris A. Jaffee wrote:
Perhaps this is a manifestation of those silent choices:
as(quality(reads[1])[[1]], "numeric")
[1] 97 97 97 96 92 97 97 97 95 96 95 96 97 97 97 97 97 97 94 88 97 96
91 68 90
[26] 96 96 97 96 96 97 96 96 96 97 96 94 96 97 97 97 83 96 92 93 96 96
92 92 97
It's not really obvious to me why quality(<...>)[[1]] extracts a
BStringSet. But once it does, the BStringSet is convert to numeric as
appropriate for a BString, which turns out to be the ASCII encoding.
as(quality(reads[1]), "numeric")
[1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63
58 35 57
[26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
59 59 64
quality(<...>) is a SFastqQuality or FastqQuality object, and is
converted to numeric as appropriate for short read qualities.
I am less than a novice in ShortRead, so I don't feel adequate enough to
post to
the world. Barely adequate enough to make a private guess.
You're too modest, Harris.
Martin
On Nov 26, 2009, at 6:13 PM, Martin Morgan wrote:
Hi Leonardo --
Leonardo Collado Torres wrote:
Dear list,
I'm puzzled over something very simple. I've got some fresh Illumina
data and I want to make a plot of the median (10% quantile, etc)
qualities (Y axis) per cycle (X axis).
On the code below, I'm simply reading a test file in fastq format,
printing the quality, checking that nothing weird went when reading the
file (could be :P), and changing the ASCII qualities to numeric values.
The symbol "D" gets changed to 35; the lowest value on the 1st sequence.
library(ShortRead)
reads <- readFastq(dirPath = ".", pattern = "test.txt")
reads[1]
class: ShortReadQ
length: 1 reads; width: 50 cycles
sread(reads[1])
A DNAStringSet instance of length 1
width seq
[1] 50 ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
quality(reads[1])
class: FastqQuality
quality:
A BStringSet instance of length 1
width seq
[1] 50 aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
system("head -n 4 test.txt")
@HWUSI-EAS636_0001:8:1:1:1353#0/1
ACCGTGCCGAGCACACGGGTGACNCCGCGATAGCCGGGCGGTGTATAGGG
+HWUSI-EAS636_0001:8:1:1:1353#0/1
aaa`\aaa_`_`aaaaaa^Xa`[DZ``a``a```a`^`aaaS`\]``\\a
as(quality(reads[1]), "numeric") # Will use "matrix" when using all
reads instead of the 1st one
[1] 64 64 64 63 59 64 64 64 62 63 62 63 64 64 64 64 64 64 61 55 64 63 58
35 57
[26] 63 63 64 63 63 64 63 63 63 64 63 61 63 64 64 64 50 63 59 60 63 63
59 59 64
summary(as(quality(reads[1]), "numeric"))
Min. 1st Qu. Median Mean 3rd Qu. Max.
35.0 62.0 63.0 61.7 64.0 64.0
Now, what confuses me is that according to
http://en.wikipedia.org/wiki/FASTQ_format SolexaPipeline 1.3+ "can
encode a Phred quality score from 0 to 62 using ASCII 64 to 126
(although in raw read data Phred scores from 0 to 40 only are
expected)". Then, on the ASCII table at wiki
(http://en.wikipedia.org/wiki/ASCII) glyph (I called it symbol earlier)
"D" corresponds to 68 decimal, which should be a Phred quality score of
6 but its 35 in R. Glyph "a" is 97 decimal, which should be 35 but is 64
in R.
fastq files don't contain enough information to know whether the data
are encoded with ASCII 64 == 1 (Solexa-style) or ASCII 32 == 1
('classic'-style). There is an argument qualityType that you can use to
specify which, so
readFastq("test.txt", qualityType="SFastqQuality")
for Solexa encoding. Note that to read a single file it is no longer
necessary to specify both directory and pattern. And finally that
as(<...>, "matrix") etc convert from the ASCII to integer encoding;
there is still a transform from integer to p value, if that is desired,
and that there are several possibilities for that transformation as well.
Martin
So, do I substract 29 to all the values I get in R or am I missing
something?
Thank you and enjoy Thanksgiving if you celebrate it,
Leonardo
sessionInfo()
R version 2.10.0 (2009-10-26)
x86_64-unknown-linux-gnu
locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3]
LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5]
LC_MONETARY=C LC_MESSAGES=en_US.iso885915 [7]
LC_PAPER=en_US.iso885915 LC_NAME=C [9]
LC_ADDRESS=C LC_TELEPHONE=C [11]
LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.4.0 lattice_0.17-26 BSgenome_1.14.0
Biostrings_2.14.0
[5] IRanges_1.4.0
loaded via a namespace (and not attached):
[1] Biobase_2.6.0 grid_2.10.0 hwriter_1.1 tools_2.10.0
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Leonardo Collado Torres, Bachelor in Genomic Sciences
Teacher at LCG and member of Dr. Enrique Morett's lab
UNAM Campus Cuernavaca, Mexico
Homepage: http://www.lcg.unam.mx/~lcollado/
Phone: [52] (777) 313-28-05
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing