Not intending to hijack the thread, but in response to John's comment
-- I, too, made a general solution for embarassingly parallel problems
but instead of splitting the large files on disk, I just use seek to
move the file pointer so each task can grab it's part.

On Tue, Aug 2, 2011 at 10:54 AM, Duddy, John <> wrote:
> I did something similar, but implemented as an evolution of the original 
> "basic" parallelism (see BWA), that:
> - Moved the splitting of input files into the datatype classes
> - Allowed any number of inputs to be split, as long as they were the same 
> datatype (so they were mutually consistent - think paired end fastq files)
> - Allowed other inputs to be shared among jobs
> - Merged any number of outputs, which merge code implemented in the datatype 
> classes
> This worked functionally, but the IO required to split large files has proved 
> too much for something like a whole genome (~500GB)
> I was thinking of something philosophically similar to your dataset container 
> idea, but more in the idea that a dataset is no longer a "file", so the jobs 
> running on subsets of the dataset would just ask for the parts they need. 
> Galaxy would take care of preserving the abstraction that the subset of the 
> dataset is a single input file, perhaps by extracting the subset to a 
> temporary file on local storage. Similarly, the merged outputs would just be 
> held in the target dataset, not copied, thus making the IO cost for the 
> "merge" 0 for the simple case where it is mere concatenation.
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail:
> -----Original Message-----
> From: 
> [] On Behalf Of Andrew Straw
> Sent: Tuesday, August 02, 2011 7:13 AM
> To:
> Subject: [galaxy-dev] using Galaxy for map/reduce
> Hi all,
> I've been investigating use of Galaxy for our lab and it has many
> attractive aspects -- a big thank you to all involved.
> We still have a couple of related sticking points, however, that I would
> like to get the Galaxy developers' feedback on. Basically, I want to use
> Galaxy to run Map/Reduce type analysis on many initial data files. What
> I mean is that I want to take many initial datasets (e.g. 250 or more),
> perhaps already stored in a library, and then apply a workflow to each
> and every one of them (the Map step). Then, on the many result datasets
> (one from each of the initial datasets), I want to run a Reduce step
> which creates a single dataset. I have achieved this in an imperfect and
> not-quite-working way with a few tricks, but I hope that with a little
> work, Galaxy could be much better for this type of use case.
> I have a couple of specific problems and a proposal for a general solution:
> 1) My first specific problem is that loading many datasets (e.g. 250)
> into history causes the javascript running locally withing a browser to
> be extremely slow.
> 2) My second specific problem is that applying a workflow with N steps
> to many datasets creates even more datasets (Nx250 additional datasets).
> In addition to the slow Javascript problem, there seems to be other
> issues I haven't diagnosed further, but the console in which I'm running
> indicates many errors of the type "Exception AssertionError:
> AssertionError('State <sqlalchemy.orm.state.MutableAttrInstanceState
> object at 0x7f5c18c47990> is not present in this identity map',) in
> <bound method MutableAttrInstanceState._cleanup of
> <sqlalchemy.orm.state.MutableAttrInstanceState object at
> 0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
> nginx frontend proxy gives 504 gateway time-outs.
> 3) There's no good way to do reduce within Galaxy. Currently I work
> around this by having a tool type which takes as an input a dataset and
> then uploads this to a self-written webserver, which then collects such
> uploads, performs the reduce, and offers a download link for the user to
> collect the reduced dataset. The user must manually then upload this
> dataset back into Galaxy for further processing.
> My proposal for a general solution, and what I'd be interested in
> feedback on, is an idea of a "dataset container" (this is just a working
> name). It would look and act much like a dataset in the history, but
> would in fact be a logical construct that merely bundles together a
> homogeneous bunch of datasets. When a tool (or a workflow) is applied to
> a dataset container, Galaxy would automatically create a new container
> in which each dataset in this new container is the result of running the
> tool. (Workflows with N steps would thus generate N new containers.) The
> thing I like about this idea is that it preserves the ability to use
> tools and workflows on both individual datasets and, with some
> additional logic, on these new containers. In particular, I don't think
> the tools and workflows themselves would have to be modified. This would
> seemingly mitigate the slow Javascript issue by only showing a few items
> in the history window (even though Galaxy may have launched many jobs in
> the background). Furthermore, a new Reduce tool type could then act to
> take a dataset container as input and output a single dataset.
> A library doesn't seem a good candidate for the dataset container idea I
> have above. I realize that a library also bundles together datasets, but
> it has other attributes that don't play well with the above idea (the
> idea of hierarchically arranged folders and heterogeneous datasets) nor
> can it be  represented in the history.
> I'm interested in thoughts on this proposal, as I think it would really
> help us, and I think our use case may be representative of what others
> might also like to do. I realize that in my text above I write "with
> some additional logic" to describe the work required to implement this
> idea, but the fact is that I have very little idea about how much work
> this would be. So, practically speaking, my question boils down to how
> hard would implementing this be, given the existing code base and goals?
> And, would such an implementation - if done to the taste of the Galaxy
> devs, of course - have a chance of making into the Galaxy distribution?
> Thanks,
> Andrew
> --
> Andrew D. Straw, Ph.D.
> Research Institute of Molecular Pathology (IMP)
> Vienna, Austria
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

Reply via email to