I'm using maui+torque combo for years by now, but today I've meet some
miracle.
My current setup is
maui-3.2.6p21-1234905291
torque-2.4.0b1
And it used to work last month or so, but today stopped.
Maui creates reservations to start jobs but doesn't even try to really
start them.
I will try to show latter what happens with job 7529.
# showres
------------------------------------------------------------------------------------
ReservationID Type S Start End Duration N/
P StartTime
7354 Job R -2:12:24:57 11:35:03 3:00:00:00
4/16 Tue May 12 16:24:08
7365 Job R -2:02:15:39 21:44:21 3:00:00:00
4/16 Wed May 13 02:33:26
7374 Job R -1:07:41:39 1:16:18:21 3:00:00:00
8/16 Wed May 13 21:07:26
7377 Job R -1:02:19:23 1:21:40:37 3:00:00:00
1/16 Thu May 14 02:29:42
7392 Job R -4:47:42 5:12:18 10:00:00
1/16 Fri May 15 00:01:23
7393 Job R -4:27:50 5:32:10 10:00:00
1/16 Fri May 15 00:21:15
7394 Job R -4:23:46 5:36:14 10:00:00
1/16 Fri May 15 00:25:19
7397 Job I 5:12:18 15:12:18 10:00:00
1/16 Fri May 15 10:01:23
7529 Job I 00:00:00 3:00:00:00 3:00:00:00
1/16 Fri May 15 04:49:05
7561 Job R -4:18:30 2:19:41:30 3:00:00:00
1/16 Fri May 15 00:30:35
7567 Job R -00:47:41 1:12:19 2:00:00
1/1 Fri May 15 04:01:24
------------------------------------------------------------------------------------
As we can see, reservation is created, but job remain idle.
# showres -n
------------------------------------------------------------------------------------
node035 Job 7529 Idle 16 00:00:00
3:00:00:00 Fri May 15 05:04:48
------------------------------------------------------------------------------------
I skipped rest of nodes here, but we see that reservation made for
node035.
# checkjob 7529
------------------------------------------------------------------------------------
checking job 7529
State: Idle
Creds: user:user group:group class:vasp qos:DEFAULT
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Thu May 14 23:35:15
(Time Queued Total: 5:18:18 Eligible: 5:16:08)
Total Tasks: 16
Req[0] TaskCount: 16 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Dedicated Resources Per Task: PROCS: 1 MEM: 1856M
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [D1][Q1][Q2][huge]
Flags: RESTARTABLE
Reservation '7529' (00:00:00 -> 3:00:00:00 Duration: 3:00:00:00)
PE: 16.00 StartPriority: 396
job cannot run in partition D1 (insufficient idle procs available: 0 <
16)
cannot select job 7529 for partition D2 (PartitionAccess)
job cannot run in partition Q1 (insufficient idle procs available: 0 <
16)
job cannot run in partition Q2 (insufficient idle procs available: 0 <
16)
cannot select job 7529 for partition check (PartitionAccess)
cannot select job 7529 for partition test (PartitionAccess)
job can run in partition huge (16 procs available. 16 procs required)
------------------------------------------------------------------------------------
So, here it is OK.
# diagnose -j 7529
------------------------------------------------------------------------------------
Name State Par Proc QOS WCLimit R Min
User Group Account QueuedTime Network Opsys Arch Mem
Disk Procs Class Features
7529 Idle ALL 16 DEF 3:00:00:00 1 16 user
group - 5:17:58 [NONE] [NONE] [NONE] >=0 >=0
NC0 [vasp:1] [NONE]
------------------------------------------------------------------------------------
And parts of log related to this job (collected after maui restart):
------------------------------------------------------------------------------------
05/15 04:58:24 MJobFind('7529',J,0)
05/15 04:58:24 MJobCreate(7529,TRUE,JP)
05/15 04:58:24 MRMJobPreLoad(J,7529,0)
05/15 04:58:24 MPBSJobLoad(7529,7529.master,J,TaskList,0)
05/15 04:58:24 MReqCreate(7529,SrcRQ,DstRQ,DoCreate)
05/15 04:58:24 INFO: adding requirement at slot 0
05/15 04:58:24 __MPBSGetTaskList(7529,16,NULL,0)
05/15 04:58:24 INFO: processing node request line '16'
05/15 04:58:24 __MPBSGetTaskList(7529,16,NULL,0)
05/15 04:58:24 MJobSetCreds(7529,user,group,)
05/15 04:58:24 INFO: default QOS for job 7529 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
05/15 04:58:24 MRMJobPostLoad(7529,TaskList,MASTER)
05/15 04:58:24 MCPRestore(JOB,7529,Optr)
05/15 04:58:24 MJobLoadCP(JS,JOB 7529 1242356159
<job AWDuration="0" EEWDuration="19086" Hold="0"
ReqSMinTime="00:00:00" StartCou
05/15 04:58:24 MUGetIndex(AWDuration,ValList,0)
05/15 04:58:24 MUGetIndex(EEWDuration,ValList,0)
05/15 04:58:24 MUGetIndex(Hold,ValList,0)
05/15 04:58:24 MUGetIndex(ReqSMinTime,ValList,0)
05/15 04:58:24 MUTimeFromString(00:00:00)
05/15 04:58:24 INFO: string '00:00:00' -> 0
05/15 04:58:24 MUGetIndex(StartCount,ValList,0)
05/15 04:58:24 MUGetIndex(StatMSUtl,ValList,0)
05/15 04:58:24 MUGetIndex(StatPSDed,ValList,0)
05/15 04:58:24 MUGetIndex(StatPSUtl,ValList,0)
05/15 04:58:24 MUGetIndex(SuspendDuration,ValList,0)
05/15 04:58:24 MUGetIndex(SysPrio,ValList,0)
05/15 04:58:24 MUGetIndex(SysSMinTime,ValList,0)
05/15 04:58:24 MUTimeFromString(00:00:00)
05/15 04:58:24 INFO: string '00:00:00' -> 0
05/15 04:58:24 INFO: system min start time set on job for -
INFINITY
05/15 04:58:24 INFO: default QOS for job 7529 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
05/15 04:58:24 MUNLFromTL(NL,TL)
05/15 04:58:24 INFO: default QOS for job 7529 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
05/15 04:58:24 INFO: job '7529' loaded: 16 user group
259200 Idle 0 1242336915 [NONE] [NONE] [NONE] >= 0
>= 0 [NONE] 12
------------------------------------------------------------------------------------
05/15 04:58:24 MJobCheckPolicies(7529,HARD,2,ALL,RIndex,NULL,2140000000)
05/15 04:58:24 INFO: job '7529' added to queue at slot 73
------------------------------------------------------------------------------------
05/15 04:58:24 INFO: job 7529 not considered for spanning
------------------------------------------------------------------------------------
05/15 04:58:24 MReqCheckResourceMatch(7529,0,node035,NULL)
05/15 04:58:24 INFO: node node035 can provide resources for job
7529:0
05/15 04:58:24 MJobCheckNRes(7529,node035,RQ[0], INFINITY,TCAvail,
1.000,RIndex,NULL,FeasCheck)
05/15 04:58:24 MReqCheckResourceMatch(7529,0,node035,RIndex)
05/15 04:58:24 INFO: node node035 can provide resources for job
7529:0
------------------------------------------------------------------------------------
05/15 04:58:24 INFO: node node035 supports 16 tasks of job 7529:0
for INFINITY of 3:00:00:00 (no reservation)
------------------------------------------------------------------------------------
05/15 04:58:24 INFO: 96 feasible tasks found for job 7529:0 in
partition huge (16 Needed)
05/15 04:58:24 INFO: C[00] S: 1242392150 E: 2139740800 T: 16
N: 1
05/15 04:58:24 INFO: C[00] S: 1242375675 E: 1242392150 T: 16
N: 1
05/15 04:58:24 INFO: C[01] S: 1242392150 E: 2139740800 T: 32
N: 2
05/15 04:58:24 INFO: node node035 supports 16 tasks of job 7529:0
for INFINITY of 3:00:00:00 (no reservation)
05/15 04:58:24 INFO: C[00] S: 1242356304 E: 1242375675 T: 16
N: 1
05/15 04:58:24 INFO: C[01] S: 1242375675 E: 1242392150 T: 32
N: 2
05/15 04:58:24 INFO: C[02] S: 1242392150 E: 2139740800 T: 48
N: 3
05/15 04:58:24 INFO: C[00] S: 1242356304 E: 1242375675 T: 16
N: 1
05/15 04:58:24 INFO: C[01] S: 1242375675 E: 1242375919 T: 32
N: 2
05/15 04:58:24 INFO: C[02] S: 1242375919 E: 1242392150 T: 48
N: 3
05/15 04:58:24 INFO: C[03] S: 1242392150 E: 2139740800 T: 64
N: 4
05/15 04:58:24 INFO: C[00] S: 1242356304 E: 1242375675 T: 16
N: 1
05/15 04:58:24 INFO: C[01] S: 1242375675 E: 1242375919 T: 32
N: 2
05/15 04:58:24 INFO: C[02] S: 1242375919 E: 1242392150 T: 48
N: 3
05/15 04:58:24 INFO: C[03] S: 1242392150 E: 1242599435 T: 64
N: 4
05/15 04:58:24 INFO: C[04] S: 1242599435 E: 2139740800 T: 80
N: 5
05/15 04:58:24 INFO: C[00] S: 1242356304 E: 1242375675 T: 16
N: 1
05/15 04:58:24 INFO: C[01] S: 1242375675 E: 1242375919 T: 32
N: 2
05/15 04:58:24 INFO: C[02] S: 1242375919 E: 1242392150 T: 48
N: 3
05/15 04:58:24 INFO: C[03] S: 1242392150 E: 1242520182 T: 64
N: 4
05/15 04:58:24 INFO: C[04] S: 1242520182 E: 1242599435 T: 80
N: 5
05/15 04:58:24 INFO: C[05] S: 1242599435 E: 2139740800 T: 96
N: 6
05/15 04:58:24 MJobSelectFRL(7529,G,1,RCount)
05/15 04:58:24 INFO: start time 00:00:00 found for job 7529 in
partition huge (1242356304)
05/15 04:58:24 MJobGetRange(7529,RQ,huge,
00:00:00,GRange,MAvlNodeList,NodeMap,8,NULL)
------------------------------------------------------------------------------------
05/15 04:58:24 INFO: node node035 supports 16 tasks of job 7529:0
for INFINITY of 3:00:00:00 (no reservation)
05/15 04:58:24 INFO: located resources for 16 tasks (16) in best
partition huge for job 7529 at time 00:00:00
05/15 04:58:24
MJobAllocMNL(7529,MFeasibleList,NodeMap,MOutList,MINRESOURCE,1242356304)
05/15 04:58:24 INFO: evaluating nodes on alloc iteration 0 for job
7529:0
05/15 04:58:24 INFO: evaluating nodes on alloc iteration 1 for job
7529:0
05/15 04:58:24 INFO: evaluating nodes on alloc iteration 2 for job
7529:0
05/15 04:58:24 INFO: evaluating nodes on alloc iteration 3 for job
7529:0
05/15 04:58:24 INFO: evaluating nodes on alloc iteration 4 for job
7529:0
05/15 04:58:24 INFO: evaluating nodes on alloc iteration 5 for job
7529:0
05/15 04:58:24 INFO: tasks located for job 7529: 16 of 16
required (16 feasible)
05/15 04:58:24 INFO: allocated MNode[000]x16 'node035' to 7529:0
05/15 04:58:24 MJobDistributeTasks(7529,MASTER,NodeList,TaskMap)
05/15 04:58:24 INFO: 1 node(s)/16 task(s) added to 7529:0
05/15 04:58:24 INFO: MNode[000] 'node035'(x16) added to job '7529'
[031] node035: (P:16,S:64297,M:64619,D:1) [Idle][linux]
[[NONE]]<0.000000> C:[normal 16:16][vasp 16:16][test 16:16][check
16:16][batch 16:16][DEFAULT] [
05/15 04:58:24 INFO: end of list reached. 1 nodes found
05/15 04:58:24 INFO: tasks distributed: 16 (Round-Robin)
05/15 04:58:24 MResJCreate(7529,MNodeList,00:00:00,Priority,Res)
05/15 04:58:24 MResAddNode(7529,node035,16,0)
05/15 04:58:24 MResAdjustDRes(7529,FALSE)
05/15 04:58:24 INFO: job '7529' reserved 16 tasks (partition huge)
to start in 00:00:00 on Fri May 15 04:58:24
(WC: 259200)
------------------------------------------------------------------------------------
05/15 04:58:24 MJobPReserve(7529,huge,ResCount,ResCountRej)
05/15 04:58:24 INFO: no priority reservations created (existing
reservation)
------------------------------------------------------------------------------------
05/15 04:58:24 INFO: job 7529 not considered for spanning
------------------------------------------------------------------------------------
I have to admit, that I don't have ideas what is going on. Yesterday
everything it used to work. Today, jobs a blocking cluster. Only one
way to force them to pass is to set backfill depth to 0.
Nothing happened between yesterday and today. No actions, but still
changes, isn't it a miracle? :)
--
Anton Starikov.
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers