Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

Peter Cock Thu, 16 Feb 2012 13:02:22 -0800

On Thu, Feb 16, 2012 at 6:42 PM, Fields, Christopher J
<[email protected]> wrote:
> On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:
>> I've checked in my FASTA splitting, which now seems to be
>> working OK with my BLAST tests.


(If this was unclear, I mean checked into my branch - I don't
have commit privileges to the main repository. When/if this
is ready I'll ask for it to be merged in though.)

>> So far this only does splitting
>> into chunks of the requested number of sequences, rather than
>> the option to split the whole file into a given number of pieces.
>> https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
>
> Cool!  Seems like a perfectly fine start.  I guess you could
> grab the # of sequences from the dataset somehow (I'm
> guessing that is set somehow upon import into Galaxy).

Yes, I should be able to get that from Galaxy's metadata
if known - much like how the FASTQ splitter works. It only
needs to be an estimate anyway - which is what I think
Galaxy does for large files - if we get it wrong then rather
than using n sub-jobs as suggested, we might use n+1
or n-1.

>> I also need to look at merging multiple BLAST XML outputs,
>> but this is looking promising.
>
> Yep, that's definitely one where a simple concatenation
> wouldn't work (though NCBI used to think so, years ago…)

Well, given the NCBI's historic practise of producing 'XML'
output which was the concatenation of several XML files,
some tools will tolerate this out of practicality - the Biopython
BLAST XML parser for example.

But yes, some care is needed over the header/footer to
ensure a valid XML output is created by the merge. This
may also require renumbering queries... I will check.

Peter

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

Reply via email to