Hi all,

Some jobs keep on top of IDLE jobs, and don't let the rest start (jobs
from other queues that have nothing to do with these ones).

Looking at them, I see they have resources to start running, but they
don't do: 


[EMAIL PROTECTED] ~]# checkjob -v 672949


checking job 672949 (RM job '672949.pbs02.pic.es')

State: Idle
Creds:  user:iatprd045  group:iatprd  class:ifae  qos:ilhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Tue Oct  7 06:35:52
  (Time Queued  Total: 3:02:20  Eligible: 1:20:42)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [ifae]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 0


IWD: [NONE]  Executable:  [NONE]
Bypass: 12  StartCount: 0
PartitionMask: [ALL]
SystemQueueTime: Tue Oct  7 08:17:30

PE:  1.00  StartPriority:  82
job can run in partition DEFAULT (17 procs available.  1 procs required)


]# diagnose -j 672949
Name                  State Par Proc QOS     WCLimit R  Min     User    Group  
Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs       Class 
Features

672949                 Idle ALL    1 ilh  3:00:00:00 0    1 iatprd04   iatprd   
     -     1:22:43   [NONE] [NONE] [NONE]    >=0    >=0    NC0    [ifae:1] 
[ifae]


There are some nodes where they coudl start:

td204.pic.es
     state = free
     np = 4
     properties = ifae
--

td203.pic.es
     state = free
     np = 4
     properties = ifae


# checknode td204.pic.es


checking node td204.pic.es

State:   Running  (in current state for 00:00:00)
Expected State:     Idle   SyncDeadline: Sat Oct 24 14:26:40
Configured Resources: PROCS: 4  MEM: 8115M  SWAP: 8115M  DISK: 15G
Utilized   Resources: DISK: 4752M
Dedicated  Resources: PROCS: 3
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       3.000
Network:    [DEFAULT]
Features:   [ifae]
Attributes: [Batch]
Classes:    [long 4:4][medium 4:4][short 4:4][ifae 1:4][gshort 4:4][glong 
4:4][gmedium 4:4][lhcbsl4 4:4][magic 4:4][roman 4:4]

Total Time: 58:11:34:08  Up: 58:10:24:24 (99.92%)  Active: 41:19:36:22 (71.50%)

Reservations:
  Job '672291'(x1)  -6:17:17 -> 2:17:42:43 (3:00:00:00)
  Job '672297'(x1)  -6:15:47 -> 2:17:44:13 (3:00:00:00)
  Job '672924'(x1)  -3:05:22 -> 2:20:54:38 (3:00:00:00)
JobList:  672291,672297,672924


]# diagnose -n td204.pic.es
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      
Speed  Opsys   Arch Par   Load Res Classes                        Network       
                 Features              

td204.pic.es          Running   1:4     8115:8115    10635:15387    8115:8115   
 1.00  linux [NONE] DEF   3.00 003 [long_4:4][medium_4:4][short_4 [DEFAULT]     
                 [ifae]              
-----                     ---   1:4     8115:8115    10635:15387    8115:8115  

Total Nodes: 1  (Active: 1  Idle: 0  Down: 0)




If I force them (runnjob) they start, but meanwhile, I have a looong
queueu wuth many jobs that could also start in other queues.

Where may I start looking for the source of this problem?


Cheers,
Arnau
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to