Dear Galaxy-dev guys!

Our research group have the last years been developing the Genomic HyperBrowser 
(http://hyperbrowser.uio.no), which is a system for statistical analysis of 
genomic data, built on top of Galaxy. The system currently includes 17 
hypothesis tests and 53 descriptional statistics. 

In the process of developing the HyperBrowser, we have experienced several 
shortcomings of the usual tabular formats for genomic datasets: BED, WIG, GFF, 
BedGraph, etc. This has lead us to define (yet another) format for genomic 
data: the GTrack format (short for Genomic Track). The format will hopefully be 
published in an article, currently under review (together with an extension of 
the XML format BioXSD, which supports many of the same properties as the GTrack 
format). In the process of article review, we would be very interested in the 
feedback from you guys, in order for the format to be as good as possible.

The basic issues handled are the following:

1. The first issue is the very existence of so many formats. We have, in the 
article, defined 15 different track formats and believe that these track 
formats are the main reason for the proliferation of tabular formats. The track 
formats are the usual Segments (as in BED) or Valued Segments (as in BedGraph), 
but also include other types as Points or Step Function. In addition, we 
introduce linked track types usable for analysis of three-dimensional data set. 
The GTrack format handles all 15 track types.

2. Simple to create. Allthough the GTrack format specification document is 
quite large, the format is still quite simple to handle. It is based on fixed 
columns (not attributes like GFF), but allows custom columns (unlike BED). If 
you allready have scripts creating output in a common tabular format, they 
should require little change to support GTrack.

3. Customizability. GTrack allows any number of custom columns to be added, in 
any order. Also, GTrack supports a scheme for creating GTrack subtypes. A 
GTrack subtype is a particular configuration of GTrack files explicitly created 
for specific uses/tools. All GTrack subtypes can still be handled by generic 
GTrack parsers.

4. Simple to parse. We have tried to make GTrack as simple as possible to 
parse. This includes the use of header lines for defining properties of a file. 
This eases parsing by telling the parses what is coming, plus it allows quick 
and dirty parsers to explicitly assert what they are able to handle (so as to 
abort with a clearly stated reason instead of failing silently). Also, the 
GTrack subtyping scheme allows parsers to limit their support to a subset of 
the GTrack specification, e.g. files with a fixed number and order of columns.

5. Advanced functionality. GTrack supports more advanced functionality such as 
networks of track elements and the option of defining the domain of a track 
(e.g. the genomic regions for which the track is defined).

6. Syntax, not semantics. As with BED or WIG, the GTrack format focuses on the 
structural elements of the data, e.g. how to represent data 
mathematically/informatically. We leave the specifics of interpretation to 
others (who can, for instance, use their definitons to create GTrack subtypes).

We have also been in contact with the Galaxy team and have received positive 
signals regarding future support for GTrack in Galaxy.

Note that tools for converting between GTrack and other formats, in addition to 
a tool to help create GTrack headers, will be available soon.

We hope you find the format interesting and welcome all kinds of 
feedback/suggestions. As the paper is in a review process, we would appreciatee 
feedback within the next two weeks.

Version 1.0b2 of the GTrack specification and an illustration of the 15 track 
types are available here:

    
http://hyperbrowser.uio.no/hb/static/hyperbrowser/files/gtrack/GTrack_specification.txt
    
http://hyperbrowser.uio.no/hb/static/hyperbrowser/files/gtrack/track_types.pdf

For the HyperBrowser team,
Sveinung Gundersen

--
Sveinung Gundersen
PhD Student, Bioinformatics, Dept. of Tumor Biology, Inst. for Cancer Research, 
The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway
E-mail: sveinung.gunder...@medisin.uio.no, Phone: +47 93 00 94 54

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to