Re: [galaxy-dev] using Galaxy for map/reduce

Peter Cock Tue, 02 Aug 2011 07:49:41 -0700

On Tue, Aug 2, 2011 at 3:12 PM, Andrew Straw <andrew.st...@imp.ac.at> wrote:
> ...
>
> My proposal for a general solution, and what I'd be interested in
> feedback on, is an idea of a "dataset container" (this is just a working
> name). It would look and act much like a dataset in the history, but
> would in fact be a logical construct that merely bundles together a
> homogeneous bunch of datasets. When a tool (or a workflow) is applied to
> a dataset container, Galaxy would automatically create a new container
> in which each dataset in this new container is the result of running the
> tool. (Workflows with N steps would thus generate N new containers.) The
> thing I like about this idea is that it preserves the ability to use
> tools and workflows on both individual datasets and, with some
> additional logic, on these new containers. In particular, I don't think
> the tools and workflows themselves would have to be modified. This would
> seemingly mitigate the slow Javascript issue by only showing a few items
> in the history window (even though Galaxy may have launched many jobs in
> the background). Furthermore, a new Reduce tool type could then act to
> take a dataset container as input and output a single dataset.
>
> ...


That is a very interesting idea.

Note that in some of the usecases I had in mind the order of the
sub-files was important, but in other cases not. So I think that
internally I think Galaxy would have to store a "dataset collection"
aka "homogeneous filetype collection" as an ordered list of
filenames.

As you observed, at the level of an individual tool, nothing
changes - it gets given a single input file(s) as before, but
now multiple copies of the tool will be running, each with a
different input file (or files for more complex tools).

I had been mulling over what is essentially a special case of
this - a new datatype for "collection of BLAST XML files", and
debating with myself if a zip file or simple concatenation would
work here. In the case of BLAST XML files, there is precedent
from early NCBI BLAST tools outputting concatenated XML
files (which are not valid XML).

My motivating example was the embarrassingly parallel task
of multi-query BLAST searches. Here we can split up the input
query file (*) and run the searches separately (the map step).
The potentially hard part is merging the output (the reduce).
Tabular output and plain text can basically be concatenated
(note we should preserve the original query order). For XML
(or -shudder- HTML output), a bit of data munging is needed.

Your idea is much more elegant, and to me fits nicely with
a general sub-task parallelization framework (as well as your
example of running a single workflow on a collection of data
files).

Peter

(*) You can also split the BLAST database/subject file, and
there are options to adjust the e-value significance accordingly
(so it is calculated using the full database size, not the partial
database size). The downside is the merging of the results is
much more complicated.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] using Galaxy for map/reduce

Reply via email to