On Mon, Oct 28, 2013 at 9:39 PM, McCulloch, Alan
<alan.mccull...@agresearch.co.nz> wrote:
> dear all,
>
>
>
> There have been a few posts lately about doing distributed computing via
> Galaxy – i.e.
>
> job splitters etc – below a contribution of some ideas we have developed
>
> and applied in our work, where we have arranged for some Galaxy tools to
> execute in parallel
>
> on our cluster.
>
>
>
> We have developed a job-splitter script "tardis.py" (available from
>
> https://bitbucket.org/agr-bifo/tardis), which takes marked-up
>
> standard unix commands that run an application or tool. The mark-up is
>
> prefixed to the input and output command-line options. Tardis strips off the
>
> mark-up, and re-writes the commands to refer to split inputs and outputs,
> which are then
>
> executed in parallel e.g. on a distributed compute resource. Tardis knows
>
> the output files to expect and how to join them back together.
>
>
>
> (This was referred to in our GCC2013 talk
>
> http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAbstracts.2FTalks.A_layered_genotyping-by-sequencing_pipeline_using_Galaxy
> )
>
>
>
> Any reasonable unix based data processing or analysis command may be marked
> up and run
>
> using tardis, though of course tardis needs to know how to split and join
> the data. Our approach
>
> also assumes  a “symmetrical” HPC cluster configuration, in the sense that
> each node sees the same
>
> view of the file system (and has the required underlying application
> installed). We use tardis
>
> to support both Galaxy and command-line based compute.
>
>
>
> Background / design pattern / motivating analogy: Galaxy provides a high
> level
>
> "end to end" view of a workflow; the HPC cluster resource that one uses then
> involves
>
> spraying chunks of data out into parallel processes, usually in the form of
> some kind of
>
> distributed compute cluster - but an end-user looking at a Galaxy history,
> should ideally  not be able
>
> to tell whether the workflow was run as a single process on the server, or
>
> via many parallel processes on the cluster (apart from the fact that when
> run
>
> in parallel on the cluster, its alot faster!). We noticed that  the TCP / IP
> layered networking
>
> protocol stack  provides a useful metaphor and design pattern - with the
> "end to end" topology
>
> of a Galaxy workflow  corresponding to the transport layer of TCP/ IP; and
> the distribution
>
> of computation across a cluster corresponding  to the next TCP/IP layer down
> - the packet-routing
>
> layer.
>
>
>
> This picture suggested  a strongly layered approach to provisioning
>
> Galaxy with parallelised compute on split data, and hence to an approach in
> which the
>
> footprint in the Galaxy code-base, of parallel / distributed compute
> support, should ideally
>
> (from the layered-design point of view) be minimal and superficial. Thus in
> our approach so far,
>
> the only footprint is in the tool config files, where we arrange the
> templating to
>
> (optionally) prefix the required  tardis mark-up  to the input and output
> command options, and
>
> the tardis script name to the command as a whole.  tardis then takes care of
> rewriting and
>
> launching all of the jobs, and finally joining the results back together and
> putting them where
>
> galaxy expects them to be (and also housekeeping such as collating and
> passing up stderr and stdout , and
>
> appropriate process exit codes). (For each galaxy job, tardis creates a
> working folder in a designated
>
> scratch area, where input files are uncompressed and split; job files and
> their output
>
> are stored; logging is done etc. Split data is cleaned up at the end unless
> there
>
> was an error in some part of the job, in which case everything is retained
>
> for debugging and in some cases restart)
>
>
>
> (We modify Galaxy tool-configs so that the user can optionally choose to run
>
> the tool on our HPC cluster - there are three HPC related input fields,
> appended
>
> to the input section of a tool. Here the user selects whether they want to
> use
>
> our cluster and if so, they specify the chunk size, and can also at that
> point
>
> specify a sampling rate, since we often find it useful to be able to run
> preliminary
>
> analyses on a random sample of (for example) single or paired-end NGS
> sequence
>
> data, to obtain a fairly quick snapshot of the data, before the expense of a
>
> complete run. We found it convenient to include support for input sampling
>
> in tardis).
>
>
>
> The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number
> of
>
> examples of marking up a command, and also a simple example of a galaxy
> tool-config that
>
> has been modified to include support for optionally running the job on our
> HPC cluster
>
> via the tardis pre-processor.
>
>
>
> Known limitations:
>
>
>
> * we have not yet attempted to integrate our approach with the existing
> Galaxy job-splitting
>
> distributed compute support, partly because of our “layered” design goal
> (admittedly also partly
>
> because of ignorance about its details ! )
>
>
>
> * our current implementation is quite naive in the distributed compute API
>
> it uses - it supports launching condor job files (and also native
> sub-processes) - our plan
>
> is to replace that with using the drmaa API
>
>
>
> * we would like to integrate it better with the galaxy type system, probably
> via
>
> a galaxy-tardis wrapper
>
>
>
> We would be keen to contribute our approach to Galaxy if people are
>
> interested.
>

Very interesting approach, thanks for reaching out and open sourcing your stuff!

What form would this contribution take? Are there changes that need to
be made to the Galaxy core that need to be made or were you hoping to
mark up the core tools with tardis markup? I would be happy to help
adjust (generalize?) the framework to support your use cases if there
are things to be done in that arena, but I would be skeptical about
marking up the core tools with tardis descriptions and introducing a
dependency between these tools an tardis.

I would imagine most Galaxy developers would agree with me that the
Galaxy philosophy is that the tool should be agnostic to the job
manager/DRM used - in fact they should all ideally work with the local
job runner with no DRM available at all. Handling that interaction is
the responsibility of job runners. Galaxy is structured in layers too,
it just job manager interaction is above the tool layer not below it.
I am not certain this is the correct philosophy - but I suspect it is
the Galaxy philosophy and it is how Galaxy main the devteam tools are
likely to continue to operate.

Like I said though, I would be eager for the platform itself to
support alternatives models of interaction between the Galaxy, Jobs,
Tools, Parallelism, etc... One of the really nice things about the
tardis approach is that it seems very self contained and all at the
tool level, so you should be able to create tardis tools pretty
easily, use them inside a normal Galaxy instance, and distribute them
on the tool shed.

I hope this is reasonable and we can find ways to collaborate even if
we don't agree on every single point :).

-John

>
>
> Cheers
>
>
>
> Alan McCulloch
>
> Bioinformatics Software Engineer
>
> AgResearch NZ
>
>
>
>
>
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to