hi Adam,

> Could you explain how your tool actually does the parallel processing on
> something that is sequential? For example in your PDF you mention the fastq
> example, but I do not see the explanation as to how it "splits" up the work
> across multiple cores/nodes. Does it simply split the sequence string N times
> and then merges the results?

yes,  if I understand you correctly.  A typical use-case for us is trimming and 
reference-aligning large numbers of 
short reads, usually paired. Tardis reads the (usually compressed) fastq input 
file (files if paired), and every "chunksize" sequence 
records, launches a  "reconditioned"  command as a job on the cluster. The 
reconditioned command is the
same as the original, except that it refers  to the chunk-input file(s) ,  and  
the chunk output file(s), rather than 
to the original input and output files.   (Paired fastq files are split in 
"semantic lock step" - seq names are 
required to match)

After all chunks are launched tardis polls for the expected chunk-outputs (and 
the  log, stdout and stderr files
associated with each completed chunk) . Once all results are in,  the "join" 
step is done - chunk outputs  are joined 
to yield the single  output file(s) expected by Galaxy (or the user if running 
from the command line). 
stderr and stdout from each chunk is collated and emitted as the stdout and 
stderr of tardis. 
Process exit codes from the chunk jobs are collated, and tardis exits with an 
appropriate 
consensus exit code (i.e. 0 if all were 0 , > 0 if not)

Because we are typically dealing with many very large fastq files, we want to 
avoid if possible 
too much uncompressed data, and redundant copies of data, as this can very 
quickly consume
our disk resources. A few things included that we have found useful in practice 
: 

* tardis reads and splits compressed fastq on the fly, so that the original can 
remain compressed in place
* fastq to fasta conversion is done on the fly and may be specified in the 
command mark-up - again 
   this means for example original compressed fastq can remain compressed 
in-place, and we do not 
   need to format-convert the whole file first. This is good  for use-cases 
such as inexpensively 
   blasting a random sample of short reads against a database of potential 
contaminants. 
* tardis can read a list file containing the names of (usually compressed fastq 
or fasta) files, and will treat  this 
   as one single input stream (each file is opened in turn and its contents 
processed as required). Again this is 
   useful in that a number of compressed fastq files can be  input to processes 
such as trimming, blast contamination 
   check  or reference alignment, while remaining compressed in place.
   (Such a list file then works quite fine in Galaxy, provided one is not above 
cheating by telling Galaxy that a list 
   file containing the names of compressed fastqsanger files, is actually 
itself a fastqsanger file. This is 
   also useful in avoiding an overly cluttered Galaxy history) 
* outputs are all compressed by default - e.g. mark-up of 
"_condition_fastq_output_myfile.fastq" 
   will result in a file "myfile.fastq.gz" being delivered. Uncompressed output 
is specified by using
   markup such as "_condition_uncompressedfastq_output_myfile.fastq". (Trying 
to encourage
   good housekeeping)
* support for random sampling of inputs - the idea again is that this can often 
help avoid disk and 
   processing contention, by allowing users to obtain inexpensive initial 
summaries and  insights
   into the data. In some cases (e.g. bad data), this avoids the cost and delay 
of a complete analysis. 

I hope that answers your question, and sorry if this post is a bit long.

Updating this to use drmaa API is about next on the to-do list (- I 'd be 
grateful
for any tips on python drmaa ? - e.g. best library or approach).

Also grateful for any general suggestions or comments. 

Cheers

Alan

 

> -----Original Message-----
> From: Adam Brenner [mailto:aebre...@uci.edu]
> Sent: Wednesday, 30 October 2013 5:37 a.m.
> To: McCulloch, Alan
> Cc: galaxy-dev@lists.bx.psu.edu; Harry Mangalam
> Subject: Re: [galaxy-dev] tardis job splitter
> 
> Alan,
> 
> At first glance this look promising. I am a little leery of tools that claim 
> to do
> parallel processing. However I would like to test it out on our HPC cluster
> here at UCI.
> 
> Few questions:
> 
> Could you explain how your tool actually does the parallel processing on
> something that is sequential? For example in your PDF you mention the fastq
> example, but I do not see the explanation as to how it "splits" up the work
> across multiple cores/nodes. Does it simply split the sequence string N times
> and then merges the results?
> 
> > * our current implementation is quite naive in the distributed compute
> > API it uses - it supports launching condor job files (and also native
> > sub-processes) - our plan is to replace that with using the drmaa API
> 
> We are strictly a SGE (Son of Grid Engine) cluster with a lot of work done by
> Joseph Farran (check pointing, freeq system, etc). Using DRMAA APIs would
> be great. If this tool can parallel fastq jobs along with BAM as described, it
> would be a great improvement for a number of people here.
> 
> ~Adam
> 
> --
> Adam Brenner
> Computer Science, Undergraduate Student
> Donald Bren School of Information and Computer Sciences
> 
> Research Computing Support
> Office of Information Technology
> http://www.oit.uci.edu/rcs/
> 
> University of California, Irvine
> www.ics.uci.edu/~aebrenne/
> aebre...@uci.edu
> 
> 
> On Mon, Oct 28, 2013 at 7:39 PM, McCulloch, Alan
> <alan.mccull...@agresearch.co.nz> wrote:
> > dear all,
> >
> >
> >
> > There have been a few posts lately about doing distributed computing
> > via Galaxy - i.e.
> >
> > job splitters etc - below a contribution of some ideas we have
> > developed
> >
> > and applied in our work, where we have arranged for some Galaxy tools
> > to execute in parallel
> >
> > on our cluster.
> >
> >
> >
> > We have developed a job-splitter script "tardis.py" (available from
> >
> > https://bitbucket.org/agr-bifo/tardis), which takes marked-up
> >
> > standard unix commands that run an application or tool. The mark-up is
> >
> > prefixed to the input and output command-line options. Tardis strips
> > off the
> >
> > mark-up, and re-writes the commands to refer to split inputs and
> > outputs, which are then
> >
> > executed in parallel e.g. on a distributed compute resource. Tardis
> > knows
> >
> > the output files to expect and how to join them back together.
> >
> >
> >
> > (This was referred to in our GCC2013 talk
> >
> >
> http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC201
> > 3.2FAbstracts.2FTalks.A_layered_genotyping-by-sequencing_pipeline_usin
> > g_Galaxy
> > )
> >
> >
> >
> > Any reasonable unix based data processing or analysis command may be
> > marked up and run
> >
> > using tardis, though of course tardis needs to know how to split and
> > join the data. Our approach
> >
> > also assumes  a "symmetrical" HPC cluster configuration, in the sense
> > that each node sees the same
> >
> > view of the file system (and has the required underlying application
> > installed). We use tardis
> >
> > to support both Galaxy and command-line based compute.
> >
> >
> >
> > Background / design pattern / motivating analogy: Galaxy provides a
> > high level
> >
> > "end to end" view of a workflow; the HPC cluster resource that one
> > uses then involves
> >
> > spraying chunks of data out into parallel processes, usually in the
> > form of some kind of
> >
> > distributed compute cluster - but an end-user looking at a Galaxy
> > history, should ideally  not be able
> >
> > to tell whether the workflow was run as a single process on the
> > server, or
> >
> > via many parallel processes on the cluster (apart from the fact that
> > when run
> >
> > in parallel on the cluster, its alot faster!). We noticed that  the
> > TCP / IP layered networking
> >
> > protocol stack  provides a useful metaphor and design pattern - with
> > the "end to end" topology
> >
> > of a Galaxy workflow  corresponding to the transport layer of TCP/ IP;
> > and the distribution
> >
> > of computation across a cluster corresponding  to the next TCP/IP
> > layer down
> > - the packet-routing
> >
> > layer.
> >
> >
> >
> > This picture suggested  a strongly layered approach to provisioning
> >
> > Galaxy with parallelised compute on split data, and hence to an
> > approach in which the
> >
> > footprint in the Galaxy code-base, of parallel / distributed compute
> > support, should ideally
> >
> > (from the layered-design point of view) be minimal and superficial.
> > Thus in our approach so far,
> >
> > the only footprint is in the tool config files, where we arrange the
> > templating to
> >
> > (optionally) prefix the required  tardis mark-up  to the input and
> > output command options, and
> >
> > the tardis script name to the command as a whole.  tardis then takes
> > care of rewriting and
> >
> > launching all of the jobs, and finally joining the results back
> > together and putting them where
> >
> > galaxy expects them to be (and also housekeeping such as collating and
> > passing up stderr and stdout , and
> >
> > appropriate process exit codes). (For each galaxy job, tardis creates
> > a working folder in a designated
> >
> > scratch area, where input files are uncompressed and split; job files
> > and their output
> >
> > are stored; logging is done etc. Split data is cleaned up at the end
> > unless there
> >
> > was an error in some part of the job, in which case everything is
> > retained
> >
> > for debugging and in some cases restart)
> >
> >
> >
> > (We modify Galaxy tool-configs so that the user can optionally choose
> > to run
> >
> > the tool on our HPC cluster - there are three HPC related input
> > fields, appended
> >
> > to the input section of a tool. Here the user selects whether they
> > want to use
> >
> > our cluster and if so, they specify the chunk size, and can also at
> > that point
> >
> > specify a sampling rate, since we often find it useful to be able to
> > run preliminary
> >
> > analyses on a random sample of (for example) single or paired-end NGS
> > sequence
> >
> > data, to obtain a fairly quick snapshot of the data, before the
> > expense of a
> >
> > complete run. We found it convenient to include support for input
> > sampling
> >
> > in tardis).
> >
> >
> >
> > The pdf document at https://bitbucket.org/agr-bifo/tardis includes a
> > number of
> >
> > examples of marking up a command, and also a simple example of a
> > galaxy tool-config that
> >
> > has been modified to include support for optionally running the job on
> > our HPC cluster
> >
> > via the tardis pre-processor.
> >
> >
> >
> > Known limitations:
> >
> >
> >
> > * we have not yet attempted to integrate our approach with the
> > existing Galaxy job-splitting
> >
> > distributed compute support, partly because of our "layered" design
> > goal (admittedly also partly
> >
> > because of ignorance about its details ! )
> >
> >
> >
> > * our current implementation is quite naive in the distributed compute
> > API
> >
> > it uses - it supports launching condor job files (and also native
> > sub-processes) - our plan
> >
> > is to replace that with using the drmaa API
> >
> >
> >
> > * we would like to integrate it better with the galaxy type system,
> > probably via
> >
> > a galaxy-tardis wrapper
> >
> >
> >
> > We would be keen to contribute our approach to Galaxy if people are
> >
> > interested.
> >
> >
> >
> > Cheers
> >
> >
> >
> > Alan McCulloch
> >
> > Bioinformatics Software Engineer
> >
> > AgResearch NZ
> >
> >
> >
> >
> >
> >
> >
> __________________________________________________________
> _
> > Please keep all replies on the list by using "reply all"
> > in your mail client.  To manage your subscriptions to this and other
> > Galaxy lists, please use the interface at:
> >   http://lists.bx.psu.edu/
> >
> > To search Galaxy mailing lists use the unified search at:
> >   http://galaxyproject.org/search/mailinglists/

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to