On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <p.j.a.c...@googlemail.com> 
> wrote:
>> On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <jan.c...@jankanis.nl> wrote:
>>> I am not using job splitting, because I am implementing this for a client
>>> with a small (one machine) galaxy setup.
>> Ah - this also explains why a job size limit is important for you.
>>> Implementing a query limit feature in galaxy core would probably be the best
>>> idea, but that would also probably require an admin screen to edit those
>>> limits, and I don't think I can sell the required time to my boss under the
>>> contract we have with the client.
>> The wrapper script idea I outlined to you earlier would be the least
>> invasive (although might cause trouble if BLAST is run at the command
>> line outside Galaxy), while your idea of inserting the check script into
>> the Galaxy Tool XML just before running BLAST itself should also
>> work well.
> While looking an Jan's pull request to insert a query size limit before
> running BLAST https://github.com/peterjc/galaxy_blast/pull/43
> I realised that this will not work so well if job-splitting is enabled.
> If using the job-splitting parallelism setting in Galaxy, then the BLAST
> query FASTA file is broken up into chunks of 1000 sequences. This
> means the new check would be make at the chunk level - so it could
> in effect catch extremely long query sequences (e.g. chromosomes),
> but could not block anyone submitting one query FASTA file containing
> many thousands of moderate length query sequences (e.g. genes).
> John - that Trello issue you logged, https://trello.com/c/0XQXVhRz
> Generic infrastructure to let deployers specify limits for tools based
> on input metadata (number of sequences, file size, etc...)
> Would it be fair to say this is not likely to be implemented in the near
> future? i.e. Should we consider implementing the BLAST query limit
> approach as a short term hack?

It would be good functionality - but I don't foresee myself or anyone
on the core team getting to it in the next six months say.


I am now angry with myself though because I realized that dynamic job
destinations are a better way to implement this in the meantime (that
environment stuff was very fresh when I responded so I think I just
jumped there). You can build a flexible infrastructure locally that is
largely decoupled from the tools and that may (?) work around the task
splitting problem Peter brought up.

Outline of the idea:

Create a Python script - say lib/galaxy/jobs/mapper_limits.py and add
some functions to it like:

# Helper utilities for limiting tool inputs.
from galaxy.jobs.mapper import JobMappingException

DEFAULT_QUERY_LIMIT_MESSAGE = "Size of input exceeds query limit of
this Galaxy instance."

def assert_fewer_than_ n_sequences(input_path, n,
  ...  # compute num_sequences
  if num_sequences > n:
    raise JobMappingException(msg)

# Do same for other checks...

This is an abstract file that has nothing to do with the institution
or toolbox really. Once you get it working - open a pull request and
we can probably get this integrated into Galaxy (as long as it is
abstract enough). Then deployers can create specific rules for that
particular cluster and toolbox:

Create  lib/galaxy/jobs/runners/rules/instance_dests.py

from galaxy.jobs import mapper_limits

def limited_blast(job, app):
  inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] )
  query_file = inp_data[ "query" ].file_name
  mapper_limits.assert_fewer_than_ n_sequences( query_file, 300 )
  return app.job_config.get_destination( "blast_base" )


Then open job_conf.xml and add the correct destinations...

    <destination id="limited_blast" runner="dynamic">
      <param id="function">limited_blast</param>
    <destination id="blast_base" runner="torque> <!-- or whatever -->
    <tool id="ncbi_blastn_wrapper" destination="limited_blast" />
    <tool id="ncbi_blastp_wrapper" destination="limited_blast" />

Jan I am really sorry I didn't come up with this before you did all
that work. Hopefully what you did for "limit_query_size.py" can be
reused in this context.


> Thanks,
> Peter
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

To search Galaxy mailing lists use the unified search at:

Reply via email to