Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

Peter Cock Fri, 17 Feb 2012 04:30:17 -0800

On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
> On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
>> Cool!  Seems like a perfectly fine start.  I guess you could
>> grab the # of sequences from the dataset somehow (I'm
>> guessing that is set somehow upon import into Galaxy).
>
> Yes, I should be able to get that from Galaxy's metadata
> if known - much like how the FASTQ splitter works. It only
> needs to be an estimate anyway - which is what I think
> Galaxy does for large files - if we get it wrong then rather
> than using n sub-jobs as suggested, we might use n+1
> or n-1.


Done, and it seems to be working nicely now. If we don't
know the sequence count, I divide the file based on the
total size in bytes - which avoids any extra IO.
https://bitbucket.org/peterjc/galaxy-central/changeset/26a0c0aa776d

Taking advantage of this I have switched the BLAST tools
from saying split the query into batches of 500 sequences
(which worked fine but only gave benefits if doing genome
scale queries) to just split the query into four parts (which
will be done based on the sequence count if known, or the
file size if not). This way any multi-query BLAST will get
divided and run in parallel, not just the larger jobs. This
gives a nice improvement (over yesterday's progress)
with small tasks like 10 query sequences against a big
database like NR or NT.
https://bitbucket.org/peterjc/galaxy-central/changeset/1fb89ae798be

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

Reply via email to