> -----Original Message-----
> From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-
> boun...@lists.bx.psu.edu] On Behalf Of Peter Cock
> Sent: Friday, 22 November 2013 3:51 a.m.
> To: Nate Coraor
> Cc: galaxy-dev@lists.bx.psu.edu
> Subject: Re: [galaxy-dev] Compressed data files in Galaxy? (e.g. GZIP or
> BGZF)
> 
> On Thu, Nov 21, 2013 at 2:27 PM, Peter Cock <p.j.a.c...@googlemail.com>
> wrote:
> > On Wed, May 16, 2012 at 9:39 PM, Nate Coraor <n...@bx.psu.edu> wrote:
> >> On May 16, 2012, at 9:47 AM, Peter Cock wrote:
> >>
> >>> Hello all,
> >>>
> >>> What is the current status in Galaxy for supporting compressed files?
> >>
> >> Hi Peter,
> >>
> >> Unfortunately, there's been nothing done on this so far.  I'd love to
> >> see it happen, but it hasn't been on the top of our priorities.
> >>
> >> --nate
> >
> > Hi Nate,
> >
> > Where does this sit on the Galaxy team's priorities now, 18 months on?
> > I think I asked about this at the GCC2013, any it was seen as
> > important but not yet at the top of the priorities list.
> 
> Nate's reply on Twitter explained that the public Galaxy Instance (formerly
> hosted at Penn State, now in Texas) uses transparent compression at the file
> system level with ZFS - so Galaxy doesn't need to compress individual files.
> Neat:
> https://twitter.com/natefoo/status/403531922514522112
> 
> Peter

Our approach has been to integrate compress / un-compress, with the 
job splitter / cluster launch layer that we've been working on (current
incarnation is https://bitbucket.org/agr-bifo/tardis ).

tardis approach is to sniff the input(s) and if necessary insert an 
uncompressor into the input stream - the original data is left compressed 
in-place, and  only the splits of the data are uncompressed. Each uncompressed
data chunk is  launched for processing on the cluster as soon as it becomes 
available 
from the  uncompressed input stream. This can be quite a big performance 
advantage, as 
compared with uncompressing the entire input file, before splitting it,
which can be quite slow and has a bigger disk footprint. 

We find that job splitting, compression handling (and potentially other 
low level data file transforms - e.g. handling list files, random sampling of 
input) 
are interdependent.  Potentially  best encapsulated  in a lower 
layer,  which would avoid cluttering Galaxy's  high level bio-oriented 
data type ontologies ? 

Currently (as per earlier post) we are integrating this approach into Galaxy by 
modifying 
selected tool config files, longer term we think it could be possible to slot 
this 
low level data transform and task splitting layer  into the core galaxy stack.

John Chilton kindly set up a trello card on the general topic
of task splitting - this is at 

https://trello.com/c/H87LotF7

(Apologies if this repeats some of my earlier post - the point here is that 
e.g. compress/un-compress and task splitting are interdependent in 
practice, we find)

(Having said the above - encapsulating data transforms such as 
compress/un-compress
even lower down the stack, in the file-system itself , as per the public Galaxy 
instance (ZFS), could be pretty hard to beat !)

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to