Hey Evan,

  Galaxy should perhaps be able to retry submissions that fail -
especially if they fail quickly and I have created a Trello card for
this here (https://trello.com/c/hxy2bcIb). Nate has added some
features for job state handling plugins
(https://bitbucket.org/galaxy/galaxy-central/commits/7b209e06ddb944e953d340754439f4e3e5dc339d)
and it may be possible to write a plugin to do this today though
immediate submissions failures maybe should be handled a level above
this by the framework... not sure.

  I am not really sure this is the appropriate solution though for
this particular problem though - this seems like an unfortunate
interplay between your file system and your cluster manager and it
would seem that any script or platform that automates the creation of
submissions of jobs would potentially be subject to the same problems.
Solving it in Galaxy would be a application level solution to a
system-level configuration problem in my opinion. Have you ran this
problem by the systems staff - it seems like it should be possible to
delay each submission by a half of a second or change the flushing
settings of the file system.

  As you mentioned - a local work around might be to `time.sleep(1)`
before `external_job_id = self.ds.runJob(jt)` in
lib/galaxy/jobs/runners/drmaa.py or similar line line pbs.py. Do you
want to try that and let us know if it addresses the problem?

  Finally, in terms of the workflow - if you rerun the failed step in
the GUI you should be given the option via a new checkbox on the tool
form to resume the workflow.

-John


On Mon, Jan 5, 2015 at 4:48 PM, Evan Bollig PhD <boll0...@umn.edu> wrote:
> I get this error occasionally:
>
> "/bin/sh: 1: 
> /opt/galaxy/web/database/job_working_directory/000/100/galaxy_100.sh:
> Text file busy"
>
> When this occurs, the step fails outright. Resubmitting the step
> resolves the issue and things run no problem. If this error appears
> early in a long workflow, I have to manually resubmit ALL dependent
> steps... what a pain!
>
> Perhaps this is something the Galaxy job scheduler can look out for,
> flush() the system, sleep() a second or two to let the file write and
> close, and then rerun. A more fault-tolerant way of running workflows
> without unnecessary human intervention.
>
> Cheers,
> -Evan Bollig
> Research Associate | Application Developer | User Support Consultant
> Minnesota Supercomputing Institute
> 599 Walter Library
> 612 624 1447
> e...@msi.umn.edu
> boll0...@umn.edu
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to