[Mauiusers] hurting in a bad way

James A. Peltier Mon, 02 Mar 2009 12:59:33 -0800

Sorry for the cross post all, but I'm hurting in a bad way. I attemptedan upgrade to the latest 2.4 beta snapshot to check out the new features,ever since then things have gone down hill fast. The cluster began onlyscheduling tohalf of the available compute resources. Jobs were getting cancelled andall sorts of other odd behaviour.

Out of desperation I reinstalled the whole cluster, reverting back to thelatest released stable snapshots, yet I'm still seeing odd behaviour. Forexample jobs are being "lost". They appear in showq as idle thenimmediately go

blocked then if I try to releasehold them they say the job doesn't exist.

As it sits right now it is a freshly installed cluster running the lateststable torque and maui. Some jobs are running normally, others are not.


Any help would be greatly appreciated.


details below

3963MB Nodes
            NodeName Available      Busy NodeState
              sdats0    96.30%    58.27%      Idle
              sdats1    94.82%    70.07%      Idle

Summary: 2 3963MB Nodes 95.56% Avail 61.28% Busy (Current: 100.00%Avail 0.00% Busy)


4565MB Nodes
            NodeName Available      Busy NodeState
             a05-nll    95.78%    59.02%   Running
             a06-nll    95.51%    59.19%   Running
             a07-nll    95.51%    60.19%   Running
             a08-nll    95.14%    57.64%      Idle

Summary: 4 4565MB Nodes 95.48% Avail 56.35% Busy (Current: 100.00%Avail 0.00% Busy)


8120MB Nodes
            NodeName Available      Busy NodeState
             a02-nll    94.36%    59.91%   Running

Summary: 1 8120MB Nodes 94.36% Avail 56.53% Busy (Current: 100.00%Avail 0.00% Busy)


16047MB Nodes
            NodeName Available      Busy NodeState
             ilhpc01    97.36%    49.23%      Idle
             ilhpc02    97.98%    48.86%      Idle
            linear-a    97.98%    71.86%      Idle
            linear-b    96.26%    77.95%      Idle

Summary: 4 16047MB Nodes 97.39% Avail 60.31% Busy (Current:100.00% Avail 0.00% Busy)


30775MB Nodes
            NodeName Available      Busy NodeState
               prism    95.16%    57.72%   Running

Summary: 1 30775MB Nodes 95.16% Avail 54.92% Busy (Current:100.00% Avail 0.00% Busy)


32169MB Nodes
            NodeName Available      Busy NodeState
            r1s40c01    97.98%    46.65%      Idle
            r1s39c02    97.98%    46.44%      Idle
            r1s38c03    97.36%    46.69%      Idle
            r1s37c04    96.91%    48.08%      Idle
            r1s36c05    96.60%    50.12%      Idle
            r1s35c06    96.91%    49.09%      Idle
            r1s34c07    96.60%    43.57%      Idle
            r1s33c08    96.91%    48.87%      Idle
            r1s31c09    96.60%    58.53%      Idle

Summary: 9 32169MB Nodes 97.09% Avail 47.25% Busy (Current:100.00% Avail 0.00% Busy)


32189MB Nodes
            NodeName Available      Busy NodeState
             rosetta    95.51%    62.02%      Busy

Summary: 1 32189MB Nodes 95.51% Avail 59.23% Busy (Current:100.00% Avail 100.00% Busy)


61606MB Nodes
            NodeName Available      Busy NodeState
              icarus    99.57%    56.49%   Running

Summary: 1 61606MB Nodes 99.57% Avail 56.25% Busy (Current:100.00% Avail 0.00% Busy)


128997MB Nodes
            NodeName Available      Busy NodeState
            r1s26c10    96.30%    75.33%   Running
            r1s22c11    95.51%    58.83%      Idle
            r1s18c12    96.60%    57.63%      Idle
            r1s14c13    96.30%    59.44%      Idle
            r1s10c14    96.30%    45.73%      Idle

Summary: 5 128997MB Nodes 96.20% Avail 57.13% Busy (Current:100.00% Avail 0.00% Busy)

System Summary: 30 Nodes 90.09% Avail 50.90% Busy (Current: 93.33%Avail 3.33% Busy)



Idle Jobs

JobName Priority XFactor Q User Group ProcsWCLimit Class SystemQueueTime

2901* 20058 1.1 - user students 17:12:00:00 batch Sun Mar 1 19:11:472924 19966 1.1 - user students 17:12:00:00 batch Sun Mar 1 19:16:222925 19966 1.1 - user students 17:12:00:00 batch Sun Mar 1 19:16:222921 19916 1.1 - user students 17:12:00:00 batch Sun Mar 1 19:18:522922 19916 1.1 - user students 17:12:00:00 batch Sun Mar 1 19:18:52


Jobs: 5  Total Backlog:  900.00 ProcHours  (3.17 Hours)




BLOCKED JOBS----------------

JOBNAME USERNAME STATE PROC WCLIMITQUEUETIME

3692 user2 Hold 1 00:00:00 Mon Mar 210:52:553693 user2 Hold 1 00:00:00 Mon Mar 210:52:553694 user2 Hold 1 00:00:00 Mon Mar 210:52:553695 user2 Hold 1 00:00:00 Mon Mar 210:52:563696 user2 Hold 1 99:23:59:59 Mon Mar 210:52:563697 user2 Hold 1 99:23:59:59 Mon Mar 210:52:563698 user2 Hold 1 99:23:59:59 Mon Mar 210:52:563699 user2 Hold 1 99:23:59:59 Mon Mar 210:52:563700 user2 Hold 1 99:23:59:59 Mon Mar 210:52:563701 user2 Hold 1 99:23:59:59 Mon Mar 210:52:56


Total Jobs: 62   Active Jobs: 47   Idle Jobs: 5   Blocked Jobs: 10



# checkjob 2901


checking job 2901

State: Idle
Creds:  user:user  group:students  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 7:12:00:00
SubmitTime: Sun Mar  1 19:00:35
  (Time Queued  Total: 17:53:13  Eligible: 17:42:01)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 2048M
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 647  StartCount: 3
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList:
  [a07-nll:1]
Reservation '2901' (2:08:40:19 -> 9:20:40:19  Duration: 7:12:00:00)

Messages: cannot start job - RM failure, rc: 15044, msg: 'Resourcetemporarily unavailable REJHOST=a07-nll MSG=cannot allocate node 'a07-nll'to job - node not currently available (nps needed/free: 1/0, joblist:2539.queen:0,2540.queen:1,2899.queen:2,2542.queen:3)'

PE:  1.00  StartPriority:  20159

job cannot run in partition DEFAULT (idle procs do not meet requirements :0 of 1 procs found)

idle procs: 276  feasible procs:   0

Rejection Reasons: [CPU          :    1][HostList     :   29]


# checkjob 3692


checking job 3692

State: Hold
Creds:  user:user2  group:othergroup  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 00:00:00
SubmitTime: Mon Mar  2 10:52:55
  (Time Queued  Total: 2:02:40  Eligible: 00:00:00)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: x86_64  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 400M
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

WARNING:  job has not been detected in 2:02:39
PE:  1.00  StartPriority:  2787
cannot select job 3692 for partition DEFAULT (non-idle state 'Hold')


--
James A. Peltier
Systems Analyst (FASNet), VIVARIUM Technical Director
Simon Fraser University - Burnaby Campus
Phone   : 778-782-6573
Fax     : 778-782-3045
E-Mail  : [email protected]
Website : http://www.fas.sfu.ca | http://vivarium.cs.sfu.ca
           http://blogs.sfu.ca/people/jpeltier
MSN     : [email protected]

Your mouse has moved.  Windows has detected hardware
changes that require a reboot. Click OK to reboot.
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

[Mauiusers] hurting in a bad way

Reply via email to