Re: [galaxy-dev] Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

David Kovalic Tue, 03 May 2016 15:28:23 -0700

Peter,

Thanks for the great information. I see where to tune the "split_size"
variable and also the READMEs :)



I'll do so more sleuthing, let the job run and observe. So far it is ~3hr
after job launch and still no load on the workers. I think from looking at
the /mnt/galaxy/tmp/job_working_directory all of the sub-job directories
are prepared. Maybe an issue with the job scheduler/dispatch.

Thanks again.

David


On Tue, May 3, 2016 at 5:11 PM Peter Cock <[email protected]> wrote:

> On Tue, May 3, 2016 at 9:23 PM, David Kovalic <[email protected]> wrote:
> > Peter,
> >
> > We made the modification to the config file, restarted galaxy and things
> > seem to be working from the galaxy end. We see sub-job directories being
> > created in /mnt/galaxy/tmp/job_working_directory. We think all of the
> > required job chunks have been created (i.e. total sequences/1000 sub-job
> > directories now with no more being created now)
> >
> > Now we have what may be a CloudMan question: our working cluster has a
> head
> > node and 4 workers. The head node is loaded up but the workers are idle.
> I
> > would have thought jobs should be pushing out to the workers but we don't
> > see any load on these machines.
> >
> > Any advice? Thanks.
>
> Wait a bit longer? The downside of the job splitting is the extra
> disk I/O overhead of splitting the files (here FASTA inputs) and
> then merging the output (e.g. BLAST tabular, XML, etc). IIRC,
> this happens on the head node only.
>
> I've not used CloudMan so I have no specific advice here,
> other than ask did you confirm that jobs were getting sent to
> the worker nodes before turning on use_tasked_jobs = True
> in your config/galaxy.ini file?
>
> > David
> >
> > PS. what is the path of the file which contains the split_size="1000"
> > configuration?
>
> This is currently defined by the tool wrapper author in the tool
> wrapper XML file. Setting up something which will work well
> on a broad range of input file sizes is a bit of an art - simply
> always dividing the input into 8 chunks does not scale well.
> With BLAST+ I found chunks of 1000 queries was a good
> balance, while for other tools processing FASTA inputs I
> used chunks of 2000 queries.
>
> I'll link to the latest files on GitHub, but you can browse this on
> the Galaxy Tool Shed too - it also ought to show the README
> text quite prominently:
>
> https://github.com/peterjc/galaxy_blast/tree/master/tools/ncbi_blast_plus
>
> In a simple tool, you would see the <parallelism> tag directly
> in the wrapper XML file, usually near the top by convention.
> e.g.
>
>
> https://github.com/peterjc/pico_galaxy/blob/master/tools/protein_analysis/promoter2.xml
>
> https://github.com/peterjc/pico_galaxy/blob/master/tools/protein_analysis/tmhmm2.xml
>
> However, with the BLAST+ wrappers we use macros. So,
> using BLASTX as an example, the wrapper is this XML file:
>
>
> https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_blastx_wrapper.xml
>
> No sign of the <parallelism> tag directly, but it is pulled in from:
>
>
> https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_macros.xml
>
> This happens via:
>
> ...
> <macros>
> ...
> <import>ncbi_macros.xml</import>
> </macros>
> <expand macro="parallelism" />
> ...
>
> This is a bit more complex, but means avoiding repeating the XML
> snippet in almost all the BLAST+ wrapper files.
>
> Peter
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Parallelism (job splitting) for ncbi_blast_plus running through CloudMan

Reply via email to