Re: [galaxy-dev] Tool shed and datatypes

Duddy, John Thu, 06 Oct 2011 09:45:25 -0700

GZIP files are definitely our plan. I just finished testing the code that 
distributes the processing of a FASTQ (or pair for PE) to an arbitrary number 
of tasks, where each subtask extracts just the data it needs without reading 
any of the file it does not need. It extracts the blocks of GZIPped data into a 
standalone GZIP file just by copying whole blocks and appending them (if the 
window is not aligned perfectly, there is additional processing). Since the 
entire file does not need to be read, it distributes quite nicely.

I'll be preparing a pull request for it soon.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: [email protected]

-----Original Message-----
From: Peter Cock [mailto:[email protected]] 
Sent: Thursday, October 06, 2011 9:19 AM
To: Duddy, John
Cc: Greg Von Kuster; [email protected]; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 5:02 PM, Duddy, John <[email protected]> wrote:
> As I understand it, Isilion is built up from "bricks" that have storage
> and compute power. They replicate files amongst themselves, so
> that for every IO request there are multiple systems that could
> respond. They are interconnected by an ultra fast fibre backbone.

So why not use gzipped files on top of that? Smaller chunks of
data to access so should be faster even with the decompression
once it gets to the CPU.

> So, depending on your topology, it's possible to get a lot more
> throughput by working on different sections of the same file from
> different physical computers.

Nice.

> I haven't delved into BGZF, so I can't comment. My approach to
> block GZIP was just to concatenate multiple GZIP files and keep
> a record of the offsets and sequences contained in each. The
> advantage is compatibility, in that it decompresses just like it
> was one big chunk, yet you can compose subsets of the data
> without decompressing/recompressing and (as long as we
> actually have to write out the file subsets) can reap the reduced
> IO benefits of smaller writes.

That sounds VERY similar to BGZF - have a read over the
SAM specification which covers this. Basically they stick
the block size into the gzip headers, and the BAM index files
(BAI) use a 64 bit offset which is split into the BGZF block
offset and the offset within that decompressed block. See:
http://samtools.sourceforge.net/SAM1.pdf

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Tool shed and datatypes

Reply via email to