Dear list, I'm trying to set up a limit on the number of used processors, so that a job which uses more cores than requested at the time of submit is cancelled, preferably after some grace time has passed. According to the manual the right config would be
RESOURCELIMITPOLICY PROC:EXTENDEDVIOLATION:CANCEL:00:05:00 which monitors the actual load and should cancel a job if a violation takes longer than 5 minutes. The problem: it kills any job that exceeds load 1 even if it declares several cores at submit time (and it doesn't wait 5 minutes to do so but that's another issue). For example, let's say I submit a job with -l nodes=1:ppn=4,mem=2000m which uses 4 cores. It's soon killed with the following comment in the logs: job '41975' in state 'Running' has exceeded PROC resource limit (394 > 100) (action CANCEL will be taken) The command 'diagnose -j' says: Name State Par Proc QOS WCLimit R Min User Group Account QueuedTime Network Opsys Arch Mem Disk Procs Class Features 41975 Running DEF 4 DEF 99:23:59:59 1 4 user uniuser - 00:01:35 [NONE] [NONE] [NONE] >=0 >=0 NC0 [default:1] [NONE] WARNING: job '41975' utilizes more procs than dedicated (3.94 > 1) Note that 'Proc' is '4' as it should be, however maui claims that only one processor is dedicated. 'checkjob -v 41975' says: ... Req[0] TaskCount: 4 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 500M Utilized Resources Per Task: PROCS: 3.94 MEM: 1.15 SWAP: 5.87 Avg Util Resources Per Task: PROCS: 3.94 Max Util Resources Per Task: PROCS: 3.94 MEM: 1.15 SWAP: 5.87 Average Utilized Memory: 664.63 MB Average Utilized Procs: 10.48 NodeAccess: SHARED TasksPerNode: 4 NodeCount: 1 Allocated Nodes: ... Reservation '41975' (-00:01:33 -> 99:23:58:26 Duration: 99:23:59:59) PE: 4.00 StartPriority: 19821 What seems to be happening here is that the required resources (4 cores, 2g mem) are divided equally in 4 tasks with 1 core, 500m mem each; the four processes which generate the load 3.94 are for some reason assigned to only one task rather than all 4 and this 3.94>1 'violation' triggers the cancelling of the job. Any idea how to make this work? Is there a way to set the trigger to all tasks rather than just one? We are using maui-3.2.6p19. Regards, Lech _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
