>>> I am not using job splitting, because I am implementing this for a client
>>> with a small (one machine) galaxy setup.
>> Ah - this also explains why a job size limit is important for you.
>>> Implementing a query limit feature in galaxy core would probably be the best
>>> idea, but that would also probably require an admin screen to edit those
>>> limits, and I don't think I can sell the required time to my boss under the
>>> contract we have with the client.
>> The wrapper script idea I outlined to you earlier would be the least
>> invasive (although might cause trouble if BLAST is run at the command
>> line outside Galaxy), while your idea of inserting the check script into
>> the Galaxy Tool XML just before running BLAST itself should also
>> work well.
> While looking an Jan's pull request to insert a query size limit before
> running BLAST https://github.com/peterjc/galaxy_blast/pull/43
> I realised that this will not work so well if job-splitting is enabled.
> If using the job-splitting parallelism setting in Galaxy, then the BLAST
> query FASTA file is broken up into chunks of 1000 sequences. This
> means the new check would be make at the chunk level - so it could
> in effect catch extremely long query sequences (e.g. chromosomes),
> but could not block anyone submitting one query FASTA file containing
> many thousands of moderate length query sequences (e.g. genes).
> John - that Trello issue you logged, https://trello.com/c/0XQXVhRz
> Generic infrastructure to let deployers specify limits for tools based
> on input metadata (number of sequences, file size, etc...)
> Would it be fair to say this is not likely to be implemented in the near
> future? i.e. Should we consider implementing the BLAST query limit
> approach as a short term hack?

It would be good functionality - but I don't foresee myself or anyone
on the core team getting to it in the next six months say.


I am now angry with myself though because I realized that dynamic job
destinations are a better way to implement this in the meantime (that
environment stuff was very fresh when I responded so I think I just
jumped there). You can build a flexible infrastructure locally that is
largely decoupled from the tools and that may (?) work around the task
splitting problem Peter brought up.

Outline of the idea:

Create a Python script - say lib/galaxy/jobs/mapper_limits.py and add
some functions to it like:

# Helper utilities for limiting tool inputs.
from galaxy.jobs.mapper import JobMappingException

DEFAULT_QUERY_LIMIT_MESSAGE = "Size of input exceeds query limit of
this Galaxy instance."

def assert_fewer_than_ n_sequences(input_path, n,
  ...  # compute num_sequences
  if num_sequences > n:
    raise JobMappingException(msg)

# Do same for other checks...

This is an abstract file that has nothing to do with the institution
or toolbox really. Once you get it working - open a pull request and
we can probably get this integrated into Galaxy (as long as it is
abstract enough). Then deployers can create specific rules for that
particular cluster and toolbox:

Create  lib/galaxy/jobs/runners/rules/instance_dests.py

from galaxy.jobs import mapper_limits

def limited_blast(job, app):
  inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] )
  query_file = inp_data[ "query" ].file_name
  mapper_limits.assert_fewer_than_ n_sequences( query_file, 300 )
  return app.job_config.get_destination( "blast_base" )


Then open job_conf.xml and add the correct destinations...

    <destination id="limited_blast" runner="dynamic">
      <param id="function">limited_blast</param>
    <destination id="blast_base" runner="torque> <!-- or whatever -->
    <tool id="ncbi_blastn_wrapper" destination="limited_blast" />
    <tool id="ncbi_blastp_wrapper" destination="limited_blast" />

Jan I am really sorry I didn't come up with this before you did all
that work. Hopefully what you did for "limit_query_size.py" can be
reused in this context.


