Short readers! I want to introduce the ShortRead package for high throughput sequencing. It is available using biocLite with a development version of R, or via svn. It is still under active development, but useful anyway. Here are some of the main functions:
readXStringColumns, readFastq, readAligned: these functions read sequence data into R objects. The XStringColumns variant is the building block, reading one or more columns of sequence, quality score, or other data into the corresponding XStringSet (from Biostrings) object. readFastq reads fastq-style (sequence + quality) files, readAligned reads alignment files (currently Solexa 'export' and maq 'mapview'; soon maq binary) files. SolexaPath, SolexaSet: these are functions and classes to help with Solexa data. SolexaPath provides a convenient way to navigate the file hierarchy created by a Solexa run. SolexaSet is like an ExpressionSet, coordinating sequence data with the phenotype description of the samples. The SolexaSet class is still a work in progress. alphabetByCycle, srorder, srduplicated, srsort and additional functions provide basic tools for exploring XStringSet objects. For instance, alphabetByCycle can be used to summarize nucleotide frequency or quality score by cycle; several of the data sets I've looked at show surprising patterns that trace back to quality control issues of one sort or another. All of the objects are intended to be created with constructors (e.g., the read* functions, or SolexaPath() and the like) rather than explicit calls to 'new'. There are accessors (e.g., sread(), quality() to extract the reads and quality scores) and other basic manipulations (e.g., subset operations) that coordinate different components of the object. The package is still very much in development. Complete man pages usually indicate a relatively stable structure or functionality; all of the functions and classes mentioned above have man pages. Directions include a 'qa' suite for Solexa data, a MAQ binary file parser (thanks to Simon Anders), further tools for efficiently manipulating and representing these objects (generally, 32-bit users will be frustrated by the current generation, especially in down-stream analysis), and useful functionality for qa and exploratory assessment. A vignette is also in the works, to provide common work flows and more detail on package use. I'm eager to hear your feedback, and would be happy to incorporate additions that you might have been working on for your own purposes -- the current Solexa emphasis reflects the data we have most ready access to, but ShortRead is meant as an entry point for many of the high throughput technologies. Martin -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
