On Thu, Sep 1, 2011 at 11:02 PM, Edward Kirton <eskir...@lbl.gov> wrote:
> Read QC intermediate files account for most of the storage used on our
> galaxy site. And it's a real problem that I must solve soon.
> My first attempt at taming the beast was to try to create a single read QC
> tool that did such things as convert qual encoding, qual-end trimming, etc.
> (very basic functions).  Such a tool could simply be a wrapper around your
> favorite existing tools, but doesn't keep the intermediate files.  The added
> benefit is that it runs faster because it only has to queue onto the cluster
> once.
> Sure, one might argue that it's nice to have all the intermediate files just
> in case you wish to review them, but in practice, I have found this happens
> relatively infrequently and is too expensive.  If you're a small lab maybe
> that's fine, but if you generate a lot of sequence, a more production-line
> approach is reasonable.

Sounds very sensible if you have some frequently repeated multistep
analyses.

> I've been toying with the idea of replacing all the fastq datatypes with a
> single fastq datatype that is sanger-encoded and gzipped.  I think gzipped
> reads files are about 1/4 of the unpacked version.  Of course, many tools
> will require a wrapper if they don't accept gzipped input, but that's
> trivial (and many already support compressed reads).
> However the import tool automatically uncompressed uploaded files so I'd
> need to do some hacking there to prevent this.

Hmm. Probably there are some tasks where a gzip'd FASTQ isn't
ideal, but for the fairly typical case of intreating over the records
it should be fine.

> Heck, what we really need is a nice compact binary format for reads, perhaps
> which doesn't even store ids (although pairing would need to be recorded).
> Thoughts?

What, like a BAM file of unaligned reads? Uses gzip compression, and
tracks the pairing information explicitly :) Some tools will already take
this as an input format, but not all.

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to