Sorry for the cross post all, but I'm hurting in a bad way. I attempted an upgrade to the latest 2.4 beta snapshot to check out the new features, ever since then things have gone down hill fast. The cluster began only scheduling to half of the available compute resources. Jobs were getting cancelled and all sorts of other odd behaviour.

Out of desperation I reinstalled the whole cluster, reverting back to the latest released stable snapshots, yet I'm still seeing odd behaviour. For example jobs are being "lost". They appear in showq as idle then immediately go
blocked then if I try to releasehold them they say the job doesn't exist.

As it sits right now it is a freshly installed cluster running the latest stable torque and maui. Some jobs are running normally, others are not.

Any help would be greatly appreciated.


details below

3963MB Nodes
            NodeName Available      Busy NodeState
              sdats0    96.30%    58.27%      Idle
              sdats1    94.82%    70.07%      Idle
Summary: 2 3963MB Nodes 95.56% Avail 61.28% Busy (Current: 100.00% Avail 0.00% Busy)

4565MB Nodes
            NodeName Available      Busy NodeState
             a05-nll    95.78%    59.02%   Running
             a06-nll    95.51%    59.19%   Running
             a07-nll    95.51%    60.19%   Running
             a08-nll    95.14%    57.64%      Idle
Summary: 4 4565MB Nodes 95.48% Avail 56.35% Busy (Current: 100.00% Avail 0.00% Busy)

8120MB Nodes
            NodeName Available      Busy NodeState
             a02-nll    94.36%    59.91%   Running
Summary: 1 8120MB Nodes 94.36% Avail 56.53% Busy (Current: 100.00% Avail 0.00% Busy)

16047MB Nodes
            NodeName Available      Busy NodeState
             ilhpc01    97.36%    49.23%      Idle
             ilhpc02    97.98%    48.86%      Idle
            linear-a    97.98%    71.86%      Idle
            linear-b    96.26%    77.95%      Idle
Summary: 4 16047MB Nodes 97.39% Avail 60.31% Busy (Current: 100.00% Avail 0.00% Busy)

30775MB Nodes
            NodeName Available      Busy NodeState
               prism    95.16%    57.72%   Running
Summary: 1 30775MB Nodes 95.16% Avail 54.92% Busy (Current: 100.00% Avail 0.00% Busy)

32169MB Nodes
            NodeName Available      Busy NodeState
            r1s40c01    97.98%    46.65%      Idle
            r1s39c02    97.98%    46.44%      Idle
            r1s38c03    97.36%    46.69%      Idle
            r1s37c04    96.91%    48.08%      Idle
            r1s36c05    96.60%    50.12%      Idle
            r1s35c06    96.91%    49.09%      Idle
            r1s34c07    96.60%    43.57%      Idle
            r1s33c08    96.91%    48.87%      Idle
            r1s31c09    96.60%    58.53%      Idle
Summary: 9 32169MB Nodes 97.09% Avail 47.25% Busy (Current: 100.00% Avail 0.00% Busy)

32189MB Nodes
            NodeName Available      Busy NodeState
             rosetta    95.51%    62.02%      Busy
Summary: 1 32189MB Nodes 95.51% Avail 59.23% Busy (Current: 100.00% Avail 100.00% Busy)

61606MB Nodes
            NodeName Available      Busy NodeState
              icarus    99.57%    56.49%   Running
Summary: 1 61606MB Nodes 99.57% Avail 56.25% Busy (Current: 100.00% Avail 0.00% Busy)

128997MB Nodes
            NodeName Available      Busy NodeState
            r1s26c10    96.30%    75.33%   Running
            r1s22c11    95.51%    58.83%      Idle
            r1s18c12    96.60%    57.63%      Idle
            r1s14c13    96.30%    59.44%      Idle
            r1s10c14    96.30%    45.73%      Idle
Summary: 5 128997MB Nodes 96.20% Avail 57.13% Busy (Current: 100.00% Avail 0.00% Busy)

System Summary: 30 Nodes 90.09% Avail 50.90% Busy (Current: 93.33% Avail 3.33% Busy)


Idle Jobs

JobName Priority XFactor Q User Group Procs WCLimit Class SystemQueueTime

2901* 20058 1.1 - user students 1 7:12:00:00 batch Sun Mar 1 19:11:47 2924 19966 1.1 - user students 1 7:12:00:00 batch Sun Mar 1 19:16:22 2925 19966 1.1 - user students 1 7:12:00:00 batch Sun Mar 1 19:16:22 2921 19916 1.1 - user students 1 7:12:00:00 batch Sun Mar 1 19:18:52 2922 19916 1.1 - user students 1 7:12:00:00 batch Sun Mar 1 19:18:52

Jobs: 5  Total Backlog:  900.00 ProcHours  (3.17 Hours)




BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

3692 user2 Hold 1 00:00:00 Mon Mar 2 10:52:55 3693 user2 Hold 1 00:00:00 Mon Mar 2 10:52:55 3694 user2 Hold 1 00:00:00 Mon Mar 2 10:52:55 3695 user2 Hold 1 00:00:00 Mon Mar 2 10:52:56 3696 user2 Hold 1 99:23:59:59 Mon Mar 2 10:52:56 3697 user2 Hold 1 99:23:59:59 Mon Mar 2 10:52:56 3698 user2 Hold 1 99:23:59:59 Mon Mar 2 10:52:56 3699 user2 Hold 1 99:23:59:59 Mon Mar 2 10:52:56 3700 user2 Hold 1 99:23:59:59 Mon Mar 2 10:52:56 3701 user2 Hold 1 99:23:59:59 Mon Mar 2 10:52:56

Total Jobs: 62   Active Jobs: 47   Idle Jobs: 5   Blocked Jobs: 10



# checkjob 2901


checking job 2901

State: Idle
Creds:  user:user  group:students  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 7:12:00:00
SubmitTime: Sun Mar  1 19:00:35
  (Time Queued  Total: 17:53:13  Eligible: 17:42:01)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 647  StartCount: 3
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList:
  [a07-nll:1]
Reservation '2901' (2:08:40:19 -> 9:20:40:19  Duration: 7:12:00:00)
Messages: cannot start job - RM failure, rc: 15044, msg: 'Resource temporarily unavailable REJHOST=a07-nll MSG=cannot allocate node 'a07-nll' to job - node not currently available (nps needed/free: 1/0, joblist: 2539.queen:0,2540.queen:1,2899.queen:2,2542.queen:3)'
PE:  1.00  StartPriority:  20159
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs: 276  feasible procs:   0

Rejection Reasons: [CPU          :    1][HostList     :   29]


# checkjob 3692


checking job 3692

State: Hold
Creds:  user:user2  group:othergroup  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 00:00:00
SubmitTime: Mon Mar  2 10:52:55
  (Time Queued  Total: 2:02:40  Eligible: 00:00:00)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: x86_64  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 400M
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

WARNING:  job has not been detected in 2:02:39
PE:  1.00  StartPriority:  2787
cannot select job 3692 for partition DEFAULT (non-idle state 'Hold')


--
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
Simon Fraser University - Burnaby Campus
Phone   : 778-782-6573
Fax     : 778-782-3045
E-Mail  : [email protected]
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
           http://blogs.sfu.ca/people/jpeltier
MSN     : [email protected]

Your mouse has moved.  Windows has detected hardware
changes that require a reboot. Click OK to reboot.
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to