Re: [galaxy-dev] Tool shed and datatypes

Duddy, John Thu, 06 Oct 2011 09:02:58 -0700

As I understand it, Isilion is built up from "bricks" that have storage and 
compute power. They replicate files amongst themselves, so that for every IO 
request there are multiple systems that could respond. They are interconnected 
by an ultra fast fibre backbone.

So, depending on your topology, it's possible to get a lot more throughput by 
working on different sections of the same file from different physical 
computers.

I haven't delved into BGZF, so I can't comment. My approach to block GZIP was 
just to concatenate multiple GZIP files and keep a record of the offsets and 
sequences contained in each. The advantage is compatibility, in that it 
decompresses just like it was one big chunk, yet you can compose subsets of the 
data without decompressing/recompressing and (as long as we actually have to 
write out the file subsets) can reap the reduced IO benefits of smaller writes.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: [email protected]

-----Original Message-----
From: Peter Cock [mailto:[email protected]] 
Sent: Thursday, October 06, 2011 8:16 AM
To: Duddy, John
Cc: Greg Von Kuster; [email protected]; Nate Coraor
Subject: Re: [galaxy-dev] Tool shed and datatypes

On Thu, Oct 6, 2011 at 3:48 PM, Duddy, John <[email protected]> wrote:
> I'd be up for that something like that, although I have other tasking
> in the short term after I finish my parallelism work. I'd rather not have
> Galaxy do the compression/decompression, because that will not
> effectively utilize the distributed nature of many filesystems, such
> as Isilon, that our customers use.

Is that like a compressed filesystem, where there is probably less
benefit to storing the data gzipped?

> My parallelism work (second
> phase almost done) handles that by using a block-gzipped
> format and index files that allow the compute nodes to seek to
> the blocks they need and extract from there.

How similar is your block-gzipped approach to BGZF used in BAM?

> Another thing that should probably go along with this is an
> enhancement to metadata such that it can be fed in from the
> outside. We upload files by linking to file paths, and at that
> point, we know everything about the files (index information).
> So need to decompress a 500GB file and read the whole
> thing just to count the lines - all you have to do is ask ;-}

I can see how that might be useful.

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Tool shed and datatypes

Reply via email to