With slurm, the following change that I backed out should fix the problem:
Although I do believe that if Galaxy doesn't read the completion state before
slurm "forgets" about the job (MinJobAge in slurm.conf), this change could
result in the job becoming permanently stuck in the running state.
I should have some enhancements to the DRMAA runner for slurm coming soon that
would prevent this.
On Oct 31, 2013, at 5:27 AM, Nikolai Vazov wrote:
> I discovered a weird issue in the job behaviour : Galaxy is running a long
> job on a cluster (more than 24h), about 15 hours later it misses the
> connection with SLURM on the cluster and throws the following message :
> [root@galaxy-prod01 galaxy-dist]# grep 3715200 paster.log
> galaxy.jobs.runners.drmaa INFO 2013-10-30 10:51:54,149 (555) queued as 3715200
> galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:51:55,149 (555/3715200) state
> change: job is queued and active
> galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:52:13,516 (555/3715200) state
> change: job is running
> galaxy.jobs.runners.drmaa INFO 2013-10-31 03:29:33,090 (555/3715200) job left
> DRM queue with following message: code 1: slurm_load_jobs error: Unable to
> contact slurm controller (connect failure),job_id: 3715200
> Is there a timeout in Galaxy for contacting slurm? Yet, the job is still
> running properly on the cluster ...
> Thanks for help, it's really urgent :)
> Nikolay Vazov, PhD
> Research Computing Centre - http://hpc.uio.no
> USIT, University of Oslo
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> To search Galaxy mailing lists use the unified search at:
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at: