On Sep 10, 2009, at 1:40 PM, Martin Morgan wrote:
Michael Muratet wrote:
Greetings
I would like to be able to use the ShortRead package on CIF file data
from the latest version of the Illumina SCS/Pipeline tools. It
appears
that the current version of ShortRead (1.2.1?) doesn't handle this
format. I have a snippet of a R script that will read these binary
files
and produce the same output as the Illumina cifToTxt tool and I'm
willing to do the work to incorporate it into ShortRead. Are there
already plans to do this? Can anyone point me to document that
describes
the basic syntax and data structures behind R objects of this class?
I've looked at the ShortRead source and I'm not sure I could figure
it
out just from that.
Hi Michael --
No, ShortRead does not parse CIF format. Is there a specification
somewhere? If you'd like to contribute the relevant parser, that would
be great! You'll probably want to use the development version of R and
of ShortRead (currently 1.3.33).
Martin
There's a spec on pg 119 of the v1.4 pipeline manual. The noise file
(*.cnf) follows the same format. You also have to read position files
to get the coordinates within a tile of the intensity values.
You're aiming for an object of class AlignedRead, which you would
construct from the bits you parse with a call to
AlignedRead(<your stuff here>)
The CIF files are intensity before crosstalk/offset/phase corrections
and basecalling. Is there not a separate structure for intensity
values? I see in the R folder in the source there is readIntensities
method that accepts 'SolexaIntensity' and 'IparIntensity'. I don't
know the data structures well enough yet to know where the data goes,
although I can see how one might add 'CifIntensity' to the code.
there is some 'essential' information, like the reads and their
quality
scores, the chromosome and position of alignement; other stuff gets
put
in an 'AlignedDataFrame'.
See ?AlignedRead and ?"AlignedRead-class" for more. I'm happy to
provide
additional guidance, too.
I'll download the development versions and see what I can do.
Regards
Mike
Martin
Hopefully, the CIF format will be around for awhile.
Thanks
Mike
Michael Muratet, Ph.D.
Senior Scientist
HudsonAlpha Institute for Biotechnology
[email protected]
(256) 327-0473 (p)
(256) 327-0966 (f)
Room 4005
601 Genome Way
Huntsville, Alabama 35806
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing