Sorry for the cross post all, but I'm hurting in a bad way. I attempted
an upgrade to the latest 2.4 beta snapshot to check out the new features,
ever since then things have gone down hill fast. The cluster began only
scheduling to
half of the available compute resources. Jobs were getting cancelled and
all sorts of other odd behaviour.
Out of desperation I reinstalled the whole cluster, reverting back to the
latest released stable snapshots, yet I'm still seeing odd behaviour. For
example jobs are being "lost". They appear in showq as idle then
immediately go
blocked then if I try to releasehold them they say the job doesn't exist.
As it sits right now it is a freshly installed cluster running the latest
stable torque and maui. Some jobs are running normally, others are not.
Any help would be greatly appreciated.
details below
3963MB Nodes
NodeName Available Busy NodeState
sdats0 96.30% 58.27% Idle
sdats1 94.82% 70.07% Idle
Summary: 2 3963MB Nodes 95.56% Avail 61.28% Busy (Current: 100.00%
Avail 0.00% Busy)
4565MB Nodes
NodeName Available Busy NodeState
a05-nll 95.78% 59.02% Running
a06-nll 95.51% 59.19% Running
a07-nll 95.51% 60.19% Running
a08-nll 95.14% 57.64% Idle
Summary: 4 4565MB Nodes 95.48% Avail 56.35% Busy (Current: 100.00%
Avail 0.00% Busy)
8120MB Nodes
NodeName Available Busy NodeState
a02-nll 94.36% 59.91% Running
Summary: 1 8120MB Nodes 94.36% Avail 56.53% Busy (Current: 100.00%
Avail 0.00% Busy)
16047MB Nodes
NodeName Available Busy NodeState
ilhpc01 97.36% 49.23% Idle
ilhpc02 97.98% 48.86% Idle
linear-a 97.98% 71.86% Idle
linear-b 96.26% 77.95% Idle
Summary: 4 16047MB Nodes 97.39% Avail 60.31% Busy (Current:
100.00% Avail 0.00% Busy)
30775MB Nodes
NodeName Available Busy NodeState
prism 95.16% 57.72% Running
Summary: 1 30775MB Nodes 95.16% Avail 54.92% Busy (Current:
100.00% Avail 0.00% Busy)
32169MB Nodes
NodeName Available Busy NodeState
r1s40c01 97.98% 46.65% Idle
r1s39c02 97.98% 46.44% Idle
r1s38c03 97.36% 46.69% Idle
r1s37c04 96.91% 48.08% Idle
r1s36c05 96.60% 50.12% Idle
r1s35c06 96.91% 49.09% Idle
r1s34c07 96.60% 43.57% Idle
r1s33c08 96.91% 48.87% Idle
r1s31c09 96.60% 58.53% Idle
Summary: 9 32169MB Nodes 97.09% Avail 47.25% Busy (Current:
100.00% Avail 0.00% Busy)
32189MB Nodes
NodeName Available Busy NodeState
rosetta 95.51% 62.02% Busy
Summary: 1 32189MB Nodes 95.51% Avail 59.23% Busy (Current:
100.00% Avail 100.00% Busy)
61606MB Nodes
NodeName Available Busy NodeState
icarus 99.57% 56.49% Running
Summary: 1 61606MB Nodes 99.57% Avail 56.25% Busy (Current:
100.00% Avail 0.00% Busy)
128997MB Nodes
NodeName Available Busy NodeState
r1s26c10 96.30% 75.33% Running
r1s22c11 95.51% 58.83% Idle
r1s18c12 96.60% 57.63% Idle
r1s14c13 96.30% 59.44% Idle
r1s10c14 96.30% 45.73% Idle
Summary: 5 128997MB Nodes 96.20% Avail 57.13% Busy (Current:
100.00% Avail 0.00% Busy)
System Summary: 30 Nodes 90.09% Avail 50.90% Busy (Current: 93.33%
Avail 3.33% Busy)
Idle Jobs
JobName Priority XFactor Q User Group Procs
WCLimit Class SystemQueueTime
2901* 20058 1.1 - user students 1
7:12:00:00 batch Sun Mar 1 19:11:47
2924 19966 1.1 - user students 1
7:12:00:00 batch Sun Mar 1 19:16:22
2925 19966 1.1 - user students 1
7:12:00:00 batch Sun Mar 1 19:16:22
2921 19916 1.1 - user students 1
7:12:00:00 batch Sun Mar 1 19:18:52
2922 19916 1.1 - user students 1
7:12:00:00 batch Sun Mar 1 19:18:52
Jobs: 5 Total Backlog: 900.00 ProcHours (3.17 Hours)
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
3692 user2 Hold 1 00:00:00 Mon Mar 2
10:52:55
3693 user2 Hold 1 00:00:00 Mon Mar 2
10:52:55
3694 user2 Hold 1 00:00:00 Mon Mar 2
10:52:55
3695 user2 Hold 1 00:00:00 Mon Mar 2
10:52:56
3696 user2 Hold 1 99:23:59:59 Mon Mar 2
10:52:56
3697 user2 Hold 1 99:23:59:59 Mon Mar 2
10:52:56
3698 user2 Hold 1 99:23:59:59 Mon Mar 2
10:52:56
3699 user2 Hold 1 99:23:59:59 Mon Mar 2
10:52:56
3700 user2 Hold 1 99:23:59:59 Mon Mar 2
10:52:56
3701 user2 Hold 1 99:23:59:59 Mon Mar 2
10:52:56
Total Jobs: 62 Active Jobs: 47 Idle Jobs: 5 Blocked Jobs: 10
# checkjob 2901
checking job 2901
State: Idle
Creds: user:user group:students class:batch qos:DEFAULT
WallTime: 00:00:00 of 7:12:00:00
SubmitTime: Sun Mar 1 19:00:35
(Time Queued Total: 17:53:13 Eligible: 17:42:01)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Dedicated Resources Per Task: PROCS: 1 MEM: 2048M
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 647 StartCount: 3
PartitionMask: [ALL]
Flags: HOSTLIST RESTARTABLE
HostList:
[a07-nll:1]
Reservation '2901' (2:08:40:19 -> 9:20:40:19 Duration: 7:12:00:00)
Messages: cannot start job - RM failure, rc: 15044, msg: 'Resource
temporarily unavailable REJHOST=a07-nll MSG=cannot allocate node 'a07-nll'
to job - node not currently available (nps needed/free: 1/0, joblist:
2539.queen:0,2540.queen:1,2899.queen:2,2542.queen:3)'
PE: 1.00 StartPriority: 20159
job cannot run in partition DEFAULT (idle procs do not meet requirements :
0 of 1 procs found)
idle procs: 276 feasible procs: 0
Rejection Reasons: [CPU : 1][HostList : 29]
# checkjob 3692
checking job 3692
State: Hold
Creds: user:user2 group:othergroup class:batch qos:DEFAULT
WallTime: 00:00:00 of 00:00:00
SubmitTime: Mon Mar 2 10:52:55
(Time Queued Total: 2:02:40 Eligible: 00:00:00)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: x86_64 Features: [NONE]
Dedicated Resources Per Task: PROCS: 1 MEM: 400M
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
WARNING: job has not been detected in 2:02:39
PE: 1.00 StartPriority: 2787
cannot select job 3692 for partition DEFAULT (non-idle state 'Hold')
--
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
Simon Fraser University - Burnaby Campus
Phone : 778-782-6573
Fax : 778-782-3045
E-Mail : [email protected]
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
http://blogs.sfu.ca/people/jpeltier
MSN : [email protected]
Your mouse has moved. Windows has detected hardware
changes that require a reboot. Click OK to reboot.
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers