An update:
I notice that when these jobs are stuck, one way to get them started is to set a walltime (using qalter) less than the default walltime. We set a default_walltime of 9999:00:00 at the server level and require the users to specify the needed cpu-time.

This was set a long time ago and has not been causing any issues. But it seems now that if you have set this default and then a user submits a job with an explicit -l walltime=<time> specification, then that job runs while older jobs with default walltime wait.

Can some one please shed some light on this - I am out of clues here?

Thanks.

-Neel

Neelesh Arora wrote:
Hi All,

I am using torque-2.0.0p2 and maui-3.2.6p13, and notice the following behavior today:

- There are several jobs in the queue that are in the Q state. When I do checkjob <jobid>, I get (among other things):
"job can run in partition DEFAULT (63 procs available.  1 procs required)"
but the job remains in Q forever. It is not the case of a resource requirement not being met (as the above message indicates)

- nothing untoward in the torque logs

- I see several of these messages in maui.log:
MSysRegEvent(JOBCORRUPTION: job 'jobid' has the following idle node(s) allocated: 'node114' ,0,0,1)
but these are for the running jobs, not the Q'ed jobs in question

- I also see messages like these in the maui.log:
INFO:     PBS node node114 set to state Idle (free)
INFO:     node 'node114' changed states from Running to Idle
although, this node has 2 out of 4 procs busy
this message is repeated for several nodes.

- restarting torque and maui did not help either

- if I say qrun <jobid> for the stuck jobs, I get:
qrun: Resource temporarily unavailable <jobid>

- but if I do runjob <jobid>, the jobs are started !!

I am unable to correlate all this information. Does anyone know what can be going wrong, or where else can I hunt for things?

Thanks.

-Neel
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to