Hi guys,

I think I've found a bug in Maui.  Is this the right place to post?

Maui does not wait the full extended violation time if the job has been idle in the queue for a while. If the job starts violating a resource restriction immediately when it starts, then it will be killed immediately instead of after the violation time. This does not happen if the job has not been waiting in the queue for long.

The problem is that MaxViolationTime doesn't take into account the time the job is in the queue.

To find this out, I inserted a line into maui to print out J->RULVTime and P->ResourceLimitMaxViolationTime[VRes]: 04/12 19:57:01 MSysRegEvent(For job '4660' , is J->RULVTime (45426) < P->ResourceLimitMaxViolationTime[VRes] (300) ? ,0,0,1) J->RULVTime was a very large number despite the fact that the job had only just started.

Fix suggestion, reset J->RULVTime somewhere when the job starts?

Using maui-3.2.6p19

--------------------
section of code:

MLimit.c, line 296

      case mrlpExtendedViolation:

        /* determine length of violation */

        if (J->RULVTime < P->ResourceLimitMaxViolationTime[VRes])
          {
          /* ignore violation */

          ResourceLimitsExceeded = FALSE;
          }

        break;

----------------------------------------

config:

whiteout:/var/spool/maui/log # /apps/maui/bin/showconfig -v | grep RESOURCELIMITPOLICY RESOURCELIMITPOLICY[0] PROC:EXTENDEDVIOLATION:CANCEL:00:05:00 MEM:ALWAYS:CANCEL
whiteout:/var/spool/maui/log #



--------------------------------------

whiteout:~ # tracejob -n 2 4660

Job: 4660.whiteout.sf.utas.edu.au

04/12/2007 14:46:59  S    enqueuing into batch, state 1 hop 1
04/12/2007 14:46:59  S    Job Queued at request of
                          [EMAIL PROTECTED], owner =
                          [EMAIL PROTECTED], job name =
                          species3kdFREQgood, queue = batch
04/12/2007 14:46:59  A    queue=batch
04/12/2007 19:56:40  M    Job Modified at request of
                          [EMAIL PROTECTED]
04/12/2007 19:56:40  S    Job Modified at request of
                          [EMAIL PROTECTED]
04/12/2007 19:56:40  S    Job Run at request of [EMAIL PROTECTED]
04/12/2007 19:56:40  S    Job Modified at request of
                          [EMAIL PROTECTED]
04/12/2007 19:56:40 A user=USERNAME group=users jobname=species3kdFREQgood
                          queue=batch ctime=1176353219 qtime=1176353219
etime=1176353219 start=1176371800 exec_host=whiteout
                          Resource_List.mem=2000mb Resource_List.ncpus=1
                          Resource_List.neednodes=whiteout
Resource_List.nodect=1 Resource_List.walltime=20:00:00 04/12/2007 19:57:01 S Job deleted at request of [EMAIL PROTECTED]/12/2007 19:57:01 S Job sent signal SIGTERM on delete
04/12/2007 19:57:01  M    kill_task: killing pid 19086 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19169 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19213 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19214 task 1 with sig 15
04/12/2007 19:57:01  M    kill_task: killing pid 19227 task 1 with sig 15
04/12/2007 19:57:01  A    [EMAIL PROTECTED]
04/12/2007 19:57:02  S    Exit_status=143 resources_used.cput=00:00:14
                          resources_used.mem=41120kb
                          resources_used.vmem=1816576kb
                          resources_used.walltime=00:00:21
04/12/2007 19:57:02 M scan_for_terminated: job 4660.whiteout.sf.utas.edu.au
                          task 1 terminated, sid 19086
04/12/2007 19:57:02  M    job was terminated
04/12/2007 19:57:02 A user=USERNAME group=users jobname=species3kdFREQgood
                          queue=batch ctime=1176353219 qtime=1176353219
etime=1176353219 start=1176371800 exec_host=whiteout
                          Resource_List.mem=2000mb Resource_List.ncpus=1
Resource_List.neednodes=batch Resource_List.nodect=1
                          Resource_List.walltime=20:00:00 session=19086
                          end=1176371822 Exit_status=143
                          resources_used.cput=00:00:14
                          resources_used.mem=41120kb
                          resources_used.vmem=1816576kb
                          resources_used.walltime=00:00:21
04/12/2007 19:57:08  S    dequeuing from batch, state COMPLETE
whiteout:~ #

-----------------------------------

whiteout:/var/spool/maui/log # grep 4660 maui.log.1
.....
.....
04/12 19:56:33 MJobPReserve(4660,DEFAULT,ResCount,ResCountRej)
04/12 19:56:40 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:56:40 INFO: 26 feasible tasks found for job 4660:0 in partition DEFAULT (1 Needed) 04/12 19:56:40 INFO: tasks located for job 4660: 1 of 1 required (2 feasible)
04/12 19:56:40 MJobStart(4660)
04/12 19:56:40 MRMJobStart(4660,Msg,SC)
04/12 19:56:40 MPBSJobStart(4660,WHITEOUT.SF.UTAS.EDU.AU,Msg,SC)
04/12 19:56:40 MPBSJobModify(4660,Resource_List,Resource,whiteout)
04/12 19:56:40 MPBSJobModify(4660,Resource_List,Resource,batch)
04/12 19:56:40 INFO:     job '4660' successfully started
04/12 19:56:40 INFO:     starting job '4660'
04/12 19:56:45 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:01 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:01 MSysRegEvent(For job '4660' , is J->RULVTime (45426) < P->ResourceLimitMaxViolationTime[VRes] (300) ? ,0,0,1) 04/12 19:57:01 MSysRegEvent(JOBRESVIOLATION: job '4660' in state 'Running' has exceeded PROC resource limit (200 > 100) (action CANCEL will be taken) job start time: Thu Apr 12 19:56:40 04/12 19:57:01 MRMJobCancel(4660,job violates resource utilization policies,SC) 04/12 19:57:01 MPBSJobCancel(4660,WHITEOUT.SF.UTAS.EDU.AU,CMsg,Msg,job violates resource utilization policies)
04/12 19:57:01 INFO:     job '4660' successfully cancelled
04/12 19:57:04 MPBSJobUpdate(4660,4660.whiteout.sf.utas.edu.au,TaskList,0)
04/12 19:57:04 INFO:     job '4660' changed states from Running to Completed
04/12 19:57:04 MJobProcessCompleted(4660)
04/12 19:57:04 INFO: job '4660' completed X: 0.258361 T: 21 PS: 21 A: 0.000292
04/12 19:57:04 MJobSendFB(4660)
04/12 19:57:04 INFO:     job usage sent for job '4660'
04/12 19:57:04 MJobRemove(4660)
04/12 19:57:04 MJobDestroy(4660)
whiteout:/var/spool/maui/log #







- Nick

--
Nick Sonneveld  |  [EMAIL PROTECTED]
IT Resources, University of Tasmania, Private Bag 69, Hobart Tas 7001
(03) 6226 6377  |  0407 336 309  |  Fax (03) 6226 7171
_______________________________________________
mauiusers mailing list
[EMAIL PROTECTED]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to