Hi,
using torque 2.1.2 + maui-3.2.6p16 jobs having dependencies suddenly get a
system hold which can be confusing for the administrator.
Please consider the following 2 outputs from qstat and checkjob.
#> checkjob 350236
checking job 350236
State: Hold
Creds: user:USER group:GROUP class:default qos:low
WallTime: 00:00:00 of 99:23:59:59
SubmitTime: Thu Aug 17 11:33:07
(Time Queued Total: 00:16:36 Eligible: 00:00:00)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE PREEMPTEE
Attr: PREEMPTEE
PE: 1.00 StartPriority: 1
cannot select job 350236 for partition DEFAULT (non-idle state 'Hold')
#> qstat -f 350236
Job Id: 350236.SERVER
Job_Name = e00018
job_state = H
queue = default
server = SERVER
Checkpoint = u
ctime = Thu Aug 17 11:33:07 2006
depend = afterany:[EMAIL PROTECTED]
[...]
Hold_Types = s
[...]
If I look at checkjob I realize that something is wrong with the job,
because it is in HOLD state.
Then I look at the Hold_Types in qstat and see: "SYSTEM HOLD" and conclude,
something has gone wrong. If I overlook the "depend=" line...
Now some questions:
1) Do these jobs follow the usual DEFER-routines with retry and DEFERTIME
checking? Or does maui magically know that this is NOT a deferred job?
*I* would think it is one.
2) I think a USER hold would be much more to the point. Or a new type,
DEPENDENCY-HOLD.
3) Could this somehow be made more clear to the administrator? Would be
great if the checkjob just said
"cannot select job 350236 for partition DEFAULT (non-idle state 'Hold') -
5 of 10 job-dependencies not fulfilled" or something.
That would prevent me (and others?) from wondering and also, from having to
manually use qstat AND checkjob.
Cheers,
Ronny
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers