[galaxy-dev] tardis job splitter

McCulloch, Alan Mon, 28 Oct 2013 19:40:11 -0700

dear all,

There have been a few posts lately about doing distributed computing via Galaxy 
- i.e.
job splitters etc - below a contribution of some ideas we have developed
and applied in our work, where we have arranged for some Galaxy tools to 
execute in parallel
on our cluster.


We have developed a job-splitter script "tardis.py" (available from
https://bitbucket.org/agr-bifo/tardis), which takes marked-up
standard unix commands that run an application or tool. The mark-up is
prefixed to the input and output command-line options. Tardis strips off the
mark-up, and re-writes the commands to refer to split inputs and outputs, which 
are then
executed in parallel e.g. on a distributed compute resource. Tardis knows
the output files to expect and how to join them back together.

(This was referred to in our GCC2013 talk
http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAbstracts.2FTalks.A_layered_genotyping-by-sequencing_pipeline_using_Galaxy
 )

Any reasonable unix based data processing or analysis command may be marked up 
and run
using tardis, though of course tardis needs to know how to split and join the 
data. Our approach
also assumes  a "symmetrical" HPC cluster configuration, in the sense that each 
node sees the same
view of the file system (and has the required underlying application 
installed). We use tardis
to support both Galaxy and command-line based compute.

Background / design pattern / motivating analogy: Galaxy provides a high level
"end to end" view of a workflow; the HPC cluster resource that one uses then 
involves
spraying chunks of data out into parallel processes, usually in the form of 
some kind of
distributed compute cluster - but an end-user looking at a Galaxy history, 
should ideally  not be able
to tell whether the workflow was run as a single process on the server, or
via many parallel processes on the cluster (apart from the fact that when run
in parallel on the cluster, its alot faster!). We noticed that  the TCP / IP 
layered networking
protocol stack  provides a useful metaphor and design pattern - with the  "end 
to end" topology
of a Galaxy workflow  corresponding to the transport layer of TCP/ IP; and the 
distribution
of computation across a cluster corresponding  to the next TCP/IP layer down - 
the packet-routing
layer.

This picture suggested  a strongly layered approach to provisioning
Galaxy with parallelised compute on split data, and hence to an approach in 
which the
footprint in the Galaxy code-base, of parallel / distributed compute support, 
should ideally
(from the layered-design point of view) be minimal and superficial. Thus in our 
approach so far,
the only footprint is in the tool config files, where we arrange the templating 
to
(optionally) prefix the required  tardis mark-up  to the input and output 
command options, and
the tardis script name to the command as a whole.  tardis then takes care of 
rewriting and
launching all of the jobs, and finally joining the results back together and 
putting them where
galaxy expects them to be (and also housekeeping such as collating and passing 
up stderr and stdout , and
appropriate process exit codes). (For each galaxy job, tardis creates a working 
folder in a designated
scratch area, where input files are uncompressed and split; job files and their 
output
are stored; logging is done etc. Split data is cleaned up at the end unless 
there
was an error in some part of the job, in which case everything is retained
for debugging and in some cases restart)

(We modify Galaxy tool-configs so that the user can optionally choose to run
the tool on our HPC cluster - there are three HPC related input fields, appended
to the input section of a tool. Here the user selects whether they want to use
our cluster and if so, they specify the chunk size, and can also at that point
specify a sampling rate, since we often find it useful to be able to run 
preliminary
analyses on a random sample of (for example) single or paired-end NGS sequence
data, to obtain a fairly quick snapshot of the data, before the expense of a
complete run. We found it convenient to include support for input sampling
in tardis).

The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of
examples of marking up a command, and also a simple example of a galaxy 
tool-config that
has been modified to include support for optionally running the job on our HPC 
cluster
via the tardis pre-processor.

Known limitations:

* we have not yet attempted to integrate our approach with the existing Galaxy 
job-splitting
distributed compute support, partly because of our "layered" design goal 
(admittedly also partly
because of ignorance about its details ! )

* our current implementation is quite naive in the distributed compute API
it uses - it supports launching condor job files (and also native 
sub-processes) - our plan
is to replace that with using the drmaa API

* we would like to integrate it better with the galaxy type system, probably via
a galaxy-tardis wrapper

We would be keen to contribute our approach to Galaxy if people are
interested.

Cheers

Alan McCulloch
Bioinformatics Software Engineer
AgResearch NZ

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] tardis job splitter

Reply via email to