Hi guys,
I think I found the problem. I used 'changeparam' when I really should
have restarted the maui process. After I restarted the scheduler, it
didn't seem to kill a job until after the time limit.
- Nick
Nick Sonneveld wrote:
Hullo,
I'm running maui 3.2.6p19-snap.1169758944 and I'm having trouble trying
to get it to allow resource overruns for a short time.
Current settings:
whiteout:/var/spool/maui # /apps/maui/bin/showconfig -v | grep
RESOURCELIMITPOLICY
RESOURCELIMITPOLICY[0] PROC:EXTENDEDVIOLATION:CANCEL:00:15:00
MEM:ALWAYS:CANCEL
whiteout:/var/spool/maui #
However, looking at the logs today, I saw:
whiteout:/var/spool/maui/log # grep -i 'violation' maui.log
03/07 11:36:25 MSysRegEvent(JOBRESVIOLATION: job '3648' in state
'Running' has exceeded PROC resource limit (141 > 100) (action CANCEL
will be taken) job start time: Wed Mar 7 11:35:32
03/07 11:36:25 ALERT: limit violation action CANCEL succeeded
and
whiteout:/var/spool/maui/log # tracejob 3648
Job: 3648.whiteout.sf.utas.edu.au
03/07/2007 00:40:01 S enqueuing into batch, state 1 hop 1
03/07/2007 00:40:01 S Job Queued at request of
[EMAIL PROTECTED], owner =
[EMAIL PROTECTED], job name =
Test2_4C,
queue = batch
03/07/2007 00:40:01 A queue=batch
03/07/2007 11:35:32 S Job Modified at request of
[EMAIL PROTECTED]
03/07/2007 11:35:32 S Job Run at request of
[EMAIL PROTECTED]
03/07/2007 11:35:33 M Job Modified at request of
[EMAIL PROTECTED]
03/07/2007 11:35:33 S Job Modified at request of
[EMAIL PROTECTED]
03/07/2007 11:35:33 A user=prachab group=users jobname=Test2_4C
queue=batch
ctime=1173188401 qtime=1173188401
etime=1173188401
start=1173227733 exec_host=whiteout
Resource_List.mem=2000mb Resource_List.ncpus=1
Resource_List.neednodes=whiteout
Resource_List.nodect=1
Resource_List.walltime=1000:00:00
03/07/2007 11:36:25 S Job deleted at request of
[EMAIL PROTECTED]/07/2007 11:36:25 S Job sent signal
SIGTERM on delete
03/07/2007 11:36:25 M kill_task: killing pid 32547 task 1 with sig 15
03/07/2007 11:36:25 M kill_task: killing pid 32569 task 1 with sig 15
03/07/2007 11:36:25 M kill_task: killing pid 32574 task 1 with sig 15
03/07/2007 11:36:25 M kill_task: killing pid 32615 task 1 with sig 15
03/07/2007 11:36:25 A [EMAIL PROTECTED]
03/07/2007 11:36:28 S Exit_status=143 resources_used.cput=00:00:46
resources_used.mem=300784kb
resources_used.vmem=341792kb
resources_used.walltime=00:00:52
03/07/2007 11:36:28 M kill_task: killing pid 32615 task 1 with sig 9
03/07/2007 11:36:28 M scan_for_terminated: job
3648.whiteout.sf.utas.edu.au
task 1 terminated, sid 32547
03/07/2007 11:36:28 M job was terminated
03/07/2007 11:36:28 A user=prachab group=users jobname=Test2_4C
queue=batch
ctime=1173188401 qtime=1173188401
etime=1173188401
start=1173227733 exec_host=whiteout
Resource_List.mem=2000mb Resource_List.ncpus=1
Resource_List.neednodes=batch
Resource_List.nodect=1
Resource_List.walltime=1000:00:00 session=32547
end=1173227788 Exit_status=143
resources_used.cput=00:00:46
resources_used.mem=300784kb
resources_used.vmem=341792kb
resources_used.walltime=00:00:52
03/07/2007 11:36:37 S dequeuing from batch, state COMPLETE
It looks like Maui didn't wait a full 15 minutes before killing the job.
Is there something wrong with my config?
- Nick
--
Nick Sonneveld | [EMAIL PROTECTED]
IT Resources, University of Tasmania, Private Bag 69, Hobart Tas 7001
(03) 6226 6377 | 0407 336 309 | Fax (03) 6226 7171
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers