hi, we have cluster where os is rhel 5.2, pbs version is : 2.5.8 and maui version is : 3.2.6p21 and 256 nodes. some times the job submitted by the user goes in the deferred state instead of going for execution or in the queue. Following error message is show when checkjob command is fired after performing releasehold <job id>, then it goes for either execution or in the queue from differed state. It says connection to mom time out, but node is very much online.
error : ################################################## checking job 8210 State: Idle EState: Deferred Creds: user:john group:chem account:dadopr class:chemo qos:DEFAULT WallTime: 00:00:00 of 1:00:00 SubmitTime: Thu Nov 1 15:15:13 (Time Queued Total: 00:29:00 Eligible: 00:00:02) Total Tasks: 1 Req[0] TaskCount: 1 Partition: par1 Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE job is deferred. Reason: RMFailure (cannot start job - RM failure, rc: 15043, msg: 'Execution server rejected request MSG=connection to mom timed out') Holds: Defer (hold reason: RMFailure) PE: 1.00 StartPriority: 1 cannot select job 8210 for partition par1 (job hold active) cannot select job 8210 for partition par2 (job hold active) #########################################################################
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
