Hi guys,
I think I've found a bug in Maui. Is this the right place to post?
Maui does not wait the full extended violation time if the job has been
idle in the queue for a while. If the job starts violating a resource
restriction immediately when it starts, then it will be killed
immediately instead of after the violation time. This does not happen
if the job has not been waiting in the queue for long.
The problem is that MaxViolationTime doesn't take into account the time
the job is in the queue.
To find this out, I inserted a line into maui to print out J->RULVTime
and P->ResourceLimitMaxViolationTime[VRes]:
04/12 19:57:01 MSysRegEvent(For job '4660' , is J->RULVTime (45426) <
P->ResourceLimitMaxViolationTime[VRes] (300) ? ,0,0,1)
J->RULVTime was a very large number despite the fact that the job had
only just started.
Fix suggestion, reset J->RULVTime somewhere when the job starts?
Using maui-3.2.6p19
--------------------
section of code:
MLimit.c, line 296
case mrlpExtendedViolation:
/* determine length of violation */
if (J->RULVTime < P->ResourceLimitMaxViolationTime[VRes])
{
/* ignore violation */
ResourceLimitsExceeded = FALSE;
}
break;
----------------------------------------
config:
whiteout:/var/spool/maui/log # /apps/maui/bin/showconfig -v | grep
RESOURCELIMITPOLICY
RESOURCELIMITPOLICY[0] PROC:EXTENDEDVIOLATION:CANCEL:00:05:00
MEM:ALWAYS:CANCEL
whiteout:/var/spool/maui/log #
--------------------------------------
whiteout:~ # tracejob -n 2 4660
Job: 4660.whiteout.sf.utas.edu.au
04/12/2007 14:46:59 S enqueuing into batch, state 1 hop 1
04/12/2007 14:46:59 S Job Queued at request of
[EMAIL PROTECTED], owner =
[EMAIL PROTECTED], job name =
species3kdFREQgood, queue = batch
04/12/2007 14:46:59 A queue=batch
04/12/2007 19:56:40 M Job Modified at request of
[EMAIL PROTECTED]
04/12/2007 19:56:40 S Job Modified at request of
[EMAIL PROTECTED]
04/12/2007 19:56:40 S Job Run at request of [EMAIL PROTECTED]
04/12/2007 19:56:40 S Job Modified at request of
[EMAIL PROTECTED]
04/12/2007 19:56:40 A user=USERNAME group=users
jobname=species3kdFREQgood
queue=batch ctime=1176353219 qtime=1176353219
etime=1176353219 start=1176371800
exec_host=whiteout
Resource_List.mem=2000mb Resource_List.ncpus=1
Resource_List.neednodes=whiteout
Resource_List.nodect=1
Resource_List.walltime=20:00:00
04/12/2007 19:57:01 S Job deleted at request of
[EMAIL PROTECTED]/12/2007 19:57:01 S Job sent signal
SIGTERM on delete
04/12/2007 19:57:01 M kill_task: killing pid 19086 task 1 with sig 15
04/12/2007 19:57:01 M kill_task: killing pid 19169 task 1 with sig 15
04/12/2007 19:57:01 M kill_task: killing pid 19213 task 1 with sig 15
04/12/2007 19:57:01 M kill_task: killing pid 19214 task 1 with sig 15
04/12/2007 19:57:01 M kill_task: killing pid 19227 task 1 with sig 15
04/12/2007 19:57:01 A [EMAIL PROTECTED]
04/12/2007 19:57:02 S Exit_status=143 resources_used.cput=00:00:14
resources_used.mem=41120kb
resources_used.vmem=1816576kb
resources_used.walltime=00:00:21
04/12/2007 19:57:02 M scan_for_terminated: job
4660.whiteout.sf.utas.edu.au
task 1 terminated, sid 19086
04/12/2007 19:57:02 M job was terminated
04/12/2007 19:57:02 A user=USERNAME group=users
jobname=species3kdFREQgood
queue=batch ctime=1176353219 qtime=1176353219
etime=1176353219 start=1176371800
exec_host=whiteout
Resource_List.mem=2000mb Resource_List.ncpus=1
Resource_List.neednodes=batch
Resource_List.nodect=1
Resource_List.walltime=20:00:00 session=19086
end=1176371822 Exit_status=143
resources_used.cput=00:00:14
resources_used.mem=41120kb
resources_used.vmem=1816576kb
resources_used.walltime=00:00:21
04/12/2007 19:57:08 S dequeuing from batch, state COMPLETE
whiteout:~ #
-----------------------------------
whiteout:/var/spool/maui/log # grep 4660 maui.log.1
.....
.....
04/12 19:56:33 MJobPReserve(4660,DEFAULT,ResCount,ResCountRej)
04/12 19:56:40 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:56:40 INFO: 26 feasible tasks found for job 4660:0 in
partition DEFAULT (1 Needed)
04/12 19:56:40 INFO: tasks located for job 4660: 1 of 1 required (2
feasible)
04/12 19:56:40 MJobStart(4660)
04/12 19:56:40 MRMJobStart(4660,Msg,SC)
04/12 19:56:40 MPBSJobStart(4660,WHITEOUT.SF.UTAS.EDU.AU,Msg,SC)
04/12 19:56:40 MPBSJobModify(4660,Resource_List,Resource,whiteout)
04/12 19:56:40 MPBSJobModify(4660,Resource_List,Resource,batch)
04/12 19:56:40 INFO: job '4660' successfully started
04/12 19:56:40 INFO: starting job '4660'
04/12 19:56:45 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:01 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:01 MSysRegEvent(For job '4660' , is J->RULVTime (45426) <
P->ResourceLimitMaxViolationTime[VRes] (300) ? ,0,0,1)
04/12 19:57:01 MSysRegEvent(JOBRESVIOLATION: job '4660' in state
'Running' has exceeded PROC resource limit (200 > 100) (action CANCEL
will be taken) job start time: Thu Apr 12 19:56:40
04/12 19:57:01 MRMJobCancel(4660,job violates resource utilization
policies,SC)
04/12 19:57:01 MPBSJobCancel(4660,WHITEOUT.SF.UTAS.EDU.AU,CMsg,Msg,job
violates resource utilization policies)
04/12 19:57:01 INFO: job '4660' successfully cancelled
04/12 19:57:04 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:04 INFO: job '4660' changed states from Running to Completed
04/12 19:57:04 MJobProcessCompleted(4660)
04/12 19:57:04 INFO: job '4660' completed X: 0.258361 T: 21 PS:
21 A: 0.000292
04/12 19:57:04 MJobSendFB(4660)
04/12 19:57:04 INFO: job usage sent for job '4660'
04/12 19:57:04 MJobRemove(4660)
04/12 19:57:04 MJobDestroy(4660)
whiteout:/var/spool/maui/log #
- Nick
--
Nick Sonneveld | [EMAIL PROTECTED]
IT Resources, University of Tasmania, Private Bag 69, Hobart Tas 7001
(03) 6226 6377 | 0407 336 309 | Fax (03) 6226 7171
_______________________________________________
mauiusers mailing list
[EMAIL PROTECTED]
http://www.supercluster.org/mailman/listinfo/mauiusers