Re: [galaxy-dev] using Galaxy for map/reduce

Ravi Madduri Tue, 02 Aug 2011 08:25:10 -0700

Hi
I really like this proposal. We faced some of the similar issues you talk about 
below when we tried to use galaxy to use High Throughput computing techniques 
(using Condor) for sequencing close to 500 genomes (embarrassingly parallel 
problem). We leveraged (hacked) the dataset construct but it did not map very 
well into the problem we were trying to solve. We ended up taking an approach 
that involved using the Galaxy tool mechanism to create a "Composite Dataset" 
from a filesystem location.  This approach required a configuration file to be 
updated with a directory path containing the datasets.  The directory listing 
was then filtered and displayed to the user of the tool to allow them to select 
a Genome.  The tool would then create a composite dataset consisting of a JSON 
document containing at least a list of all the files in the CG data directory. 
We are not sure how generally useful this tool would be.
On Aug 2, 2011, at 10:12 AM, Andrew Straw wrote:


> Hi all,
> 
> I've been investigating use of Galaxy for our lab and it has many
> attractive aspects -- a big thank you to all involved.
> 
> We still have a couple of related sticking points, however, that I would
> like to get the Galaxy developers' feedback on. Basically, I want to use
> Galaxy to run Map/Reduce type analysis on many initial data files. What
> I mean is that I want to take many initial datasets (e.g. 250 or more),
> perhaps already stored in a library, and then apply a workflow to each
> and every one of them (the Map step). Then, on the many result datasets
> (one from each of the initial datasets), I want to run a Reduce step
> which creates a single dataset. I have achieved this in an imperfect and
> not-quite-working way with a few tricks, but I hope that with a little
> work, Galaxy could be much better for this type of use case.
> 
> I have a couple of specific problems and a proposal for a general solution:
> 
> 1) My first specific problem is that loading many datasets (e.g. 250)
> into history causes the javascript running locally withing a browser to
> be extremely slow.
> 
> 2) My second specific problem is that applying a workflow with N steps
> to many datasets creates even more datasets (Nx250 additional datasets).
> In addition to the slow Javascript problem, there seems to be other
> issues I haven't diagnosed further, but the console in which I'm running
> run.sh indicates many errors of the type "Exception AssertionError:
> AssertionError('State <sqlalchemy.orm.state.MutableAttrInstanceState
> object at 0x7f5c18c47990> is not present in this identity map',) in
> <bound method MutableAttrInstanceState._cleanup of
> <sqlalchemy.orm.state.MutableAttrInstanceState object at
> 0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
> nginx frontend proxy gives 504 gateway time-outs.
> 
> 3) There's no good way to do reduce within Galaxy. Currently I work
> around this by having a tool type which takes as an input a dataset and
> then uploads this to a self-written webserver, which then collects such
> uploads, performs the reduce, and offers a download link for the user to
> collect the reduced dataset. The user must manually then upload this
> dataset back into Galaxy for further processing.
> 
> My proposal for a general solution, and what I'd be interested in
> feedback on, is an idea of a "dataset container" (this is just a working
> name). It would look and act much like a dataset in the history, but
> would in fact be a logical construct that merely bundles together a
> homogeneous bunch of datasets. When a tool (or a workflow) is applied to
> a dataset container, Galaxy would automatically create a new container
> in which each dataset in this new container is the result of running the
> tool. (Workflows with N steps would thus generate N new containers.) The
> thing I like about this idea is that it preserves the ability to use
> tools and workflows on both individual datasets and, with some
> additional logic, on these new containers. In particular, I don't think
> the tools and workflows themselves would have to be modified. This would
> seemingly mitigate the slow Javascript issue by only showing a few items
> in the history window (even though Galaxy may have launched many jobs in
> the background). Furthermore, a new Reduce tool type could then act to
> take a dataset container as input and output a single dataset.
> 
> A library doesn't seem a good candidate for the dataset container idea I
> have above. I realize that a library also bundles together datasets, but
> it has other attributes that don't play well with the above idea (the
> idea of hierarchically arranged folders and heterogeneous datasets) nor
> can it be  represented in the history.
> 
> I'm interested in thoughts on this proposal, as I think it would really
> help us, and I think our use case may be representative of what others
> might also like to do. I realize that in my text above I write "with
> some additional logic" to describe the work required to implement this
> idea, but the fact is that I have very little idea about how much work
> this would be. So, practically speaking, my question boils down to how
> hard would implementing this be, given the existing code base and goals?
> And, would such an implementation - if done to the taste of the Galaxy
> devs, of course - have a chance of making into the Galaxy distribution?
> 
> Thanks,
> Andrew
> 
> -- 
> Andrew D. Straw, Ph.D.
> Research Institute of Molecular Pathology (IMP)
> Vienna, Austria
> http://strawlab.org/
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/

--
Ravi K Madduri
The Globus Alliance | Argonne National Laboratory | University of Chicago
http://www.mcs.anl.gov/~madduri

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] using Galaxy for map/reduce

Reply via email to