I'd be up for that something like that, although I have other tasking in the 
short term after I finish my parallelism work. I'd rather not have Galaxy do 
the compression/decompression, because that will not effectively utilize the 
distributed nature of many filesystems, such as Isilon, that our customers use. 
My parallelism work (second phase almost done) handles that by using a 
block-gzipped format and index files that allow the compute nodes to seek to 
the blocks they need and extract from there.

Another thing that should probably go along with this is an enhancement to 
metadata such that it can be fed in from the outside. We upload files by 
linking to file paths, and at that point, we know everything about the files 
(index information). So need to decompress a 500GB file and read the whole 
thing just to count the lines - all you have to do is ask ;-}

 
John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-----Original Message-----
From: Peter Cock [mailto:p.j.a.c...@googlemail.com] 
Sent: Thursday, October 06, 2011 1:28 AM
To: Duddy, John
Cc: Greg Von Kuster; galaxy-dev@lists.bx.psu.edu; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 4:48 AM, Duddy, John <jdu...@illumina.com> wrote:
> One of the things we're facing is the sheer size of a whole human genome at
> 30x coverage. An effective way to deal with that is by compressing the FASTQ
> files. That works for BWA and our ELAND, which can directly read a
> compressed FASTQ, but other tools crash when reading compressed FASTQ
> filesfiles. One way to address that would be to introduce a new type, for
> example "CompressedFastQ", with a conversion to FASTQ defined. BWA could
> take both types as input. This would allow the best of both worlds -
> efficient storage and use by all existing tools.

We'd discussed this and a more general approach where any file
could be gzipped, but the code to do that doesn't exist yet:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-September/006745.html

Issue filed:
https://bitbucket.org/galaxy/galaxy-central/issue/666/

That seems a better long term solution than the pragmatic short term
solution of fastqsanger-gzip (or whatever it gets called). Note that it
sounded like Edward Kirton might already be using this - you should
be consistent.

The other strong idea from that thread was moving from FASTQ to
unaligned BAM, which is gzipped compressed, and has explicit
support for paired end reads, read groups, etc.

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to