Hi all, we have several jobs that have some free resources where they could start, but they don't:
i.e: # checkjob 528813 checking job 528813 State: Idle Creds: user:iatprd003 group:iatprd class:ifae qos:ilhcatlas WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Fri Sep 19 10:40:44 (Time Queued Total: 2:35:09 Eligible: 00:54:46) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [ifae] IWD: [NONE] Executable: [NONE] Bypass: 76 StartCount: 0 PartitionMask: [ALL] PE: 1.00 StartPriority: 86 job can run in partition DEFAULT (2 procs available. 1 procs required) So, maui says it has available procs, but job does not start. It's the first Idle job, and we think that this is blocking the entire queue: IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 528813 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:44 528814 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:46 528815 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:55 528816 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:40:56 528817 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:41:04 528818 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 10:41:04 529071 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:13:58 529072 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:09 529073 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:09 529074 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:09 529076 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:15 529077 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:15 529078 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:16 529080 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:21 529081 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:22 529082 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:22 529083 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:26 529084 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:27 529085 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:27 529088 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:31 529089 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:33 529090 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:33 529091 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:36 529092 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:38 529093 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:38 529094 iatprd003 Idle 1 3:00:00:00 Fri Sep 19 11:14:40 529858 atlas179 Idle 1 3:00:00:00 Fri Sep 19 13:14:56 529793 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:29 529794 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:30 529795 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:31 529796 atprd020 Idle 1 3:00:00:00 Fri Sep 19 13:05:32 If we check next job from diff user: [EMAIL PROTECTED] sbin]# checkjob 529858 checking job 529858 State: Idle Creds: user:atlas179 group:atlas class:glong qos:lhcatlas WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Fri Sep 19 13:14:56 (Time Queued Total: 00:02:58 Eligible: 00:02:58) StartDate: -00:02:01 Fri Sep 19 13:15:53 Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [slc4] IWD: [NONE] Executable: [NONE] Bypass: 4 StartCount: 0 PartitionMask: [ALL] PE: 1.00 StartPriority: -12 job can run in partition DEFAULT (283 procs available. 1 procs required) so, it has 283 procs avaliable but it doesn't start.... Those are maui logs refering to the first job: ]# grep 528813 /var/log/maui.log.1 09/19 10:42:16 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221813736 processing node request line '1' 09/19 10:44:18 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 10:46:22 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 10:48:25 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 10:50:28 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 10:50:39 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814238 processing node request line '1' 09/19 10:50:40 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 10:53:03 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814383 processing node request line '1' 09/19 10:53:04 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 10:54:50 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814489 processing node request line '1' 09/19 10:57:48 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814667 processing node request line '1' 09/19 11:00:05 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221814804 processing node request line '1' 09/19 11:16:26 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221815785 processing node request line '1' 09/19 12:13:17 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:15:18 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:17:19 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:19:20 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:21:21 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:23:22 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:25:23 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:27:24 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:29:25 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:31:26 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:33:27 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:35:28 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:37:50 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:46:23 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821182 processing node request line '1' 09/19 12:46:24 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:48:32 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821312 processing node request line '1' 09/19 12:48:47 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821327 processing node request line '1' 09/19 12:48:48 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:50:58 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:53:01 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:55:04 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:57:07 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:57:59 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821878 processing node request line '1' 09/19 12:57:59 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) 09/19 12:58:07 INFO: job '528813' loaded: 1 iatprd003 iatprd 259200 Idle 0 1221813644 [NONE] [NONE] [NONE] >= 0 >= 0 [ifae] 1221821886 processing node request line '1' 09/19 12:58:07 MJobPReserve(528813,DEFAULT,ResCount,ResCountRej) # grep " cannot set job" /var/log/maui.log|grep 529813.pbs02.pic.es 09/19 13:19:55 WARNING: cannot set job '529813.pbs02.pic.es' attr 'Resource_List:neednodes' to '1' (rc: 15001 'Unknown Job Id') Finally, I'm able to force its start: ]# runjob 528813 job '528813' started on 1 proc So, could some one give us some hint on why is maui not scheduling a job when there are avalible resources?¿ TIA, Arnau _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
