Hi all,
Some jobs keep on top of IDLE jobs, and don't let the rest start (jobs
from other queues that have nothing to do with these ones).
Looking at them, I see they have resources to start running, but they
don't do:
[EMAIL PROTECTED] ~]# checkjob -v 672949
checking job 672949 (RM job '672949.pbs02.pic.es')
State: Idle
Creds: user:iatprd045 group:iatprd class:ifae qos:ilhcatlas
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Tue Oct 7 06:35:52
(Time Queued Total: 3:02:20 Eligible: 1:20:42)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [ifae]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
NodeAccess: SHARED
NodeCount: 0
IWD: [NONE] Executable: [NONE]
Bypass: 12 StartCount: 0
PartitionMask: [ALL]
SystemQueueTime: Tue Oct 7 08:17:30
PE: 1.00 StartPriority: 82
job can run in partition DEFAULT (17 procs available. 1 procs required)
]# diagnose -j 672949
Name State Par Proc QOS WCLimit R Min User Group
Account QueuedTime Network Opsys Arch Mem Disk Procs Class
Features
672949 Idle ALL 1 ilh 3:00:00:00 0 1 iatprd04 iatprd
- 1:22:43 [NONE] [NONE] [NONE] >=0 >=0 NC0 [ifae:1]
[ifae]
There are some nodes where they coudl start:
td204.pic.es
state = free
np = 4
properties = ifae
--
td203.pic.es
state = free
np = 4
properties = ifae
# checknode td204.pic.es
checking node td204.pic.es
State: Running (in current state for 00:00:00)
Expected State: Idle SyncDeadline: Sat Oct 24 14:26:40
Configured Resources: PROCS: 4 MEM: 8115M SWAP: 8115M DISK: 15G
Utilized Resources: DISK: 4752M
Dedicated Resources: PROCS: 3
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 3.000
Network: [DEFAULT]
Features: [ifae]
Attributes: [Batch]
Classes: [long 4:4][medium 4:4][short 4:4][ifae 1:4][gshort 4:4][glong
4:4][gmedium 4:4][lhcbsl4 4:4][magic 4:4][roman 4:4]
Total Time: 58:11:34:08 Up: 58:10:24:24 (99.92%) Active: 41:19:36:22 (71.50%)
Reservations:
Job '672291'(x1) -6:17:17 -> 2:17:42:43 (3:00:00:00)
Job '672297'(x1) -6:15:47 -> 2:17:44:13 (3:00:00:00)
Job '672924'(x1) -3:05:22 -> 2:20:54:38 (3:00:00:00)
JobList: 672291,672297,672924
]# diagnose -n td204.pic.es
diagnosing node table (5120 slots)
Name State Procs Memory Disk Swap
Speed Opsys Arch Par Load Res Classes Network
Features
td204.pic.es Running 1:4 8115:8115 10635:15387 8115:8115
1.00 linux [NONE] DEF 3.00 003 [long_4:4][medium_4:4][short_4 [DEFAULT]
[ifae]
----- --- 1:4 8115:8115 10635:15387 8115:8115
Total Nodes: 1 (Active: 1 Idle: 0 Down: 0)
If I force them (runnjob) they start, but meanwhile, I have a looong
queueu wuth many jobs that could also start in other queues.
Where may I start looking for the source of this problem?
Cheers,
Arnau
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers