Dear David,
I´m sorry to bother You again with this issue, but the problem still exists.
Please have a look onto this example:
- I submitted a job like this:
qsub -q fwd -l nodes=1:ppn=4 -I -l walltime=12:00:00
- maui.log tells me that the job cannot be started:
03/21 14:30:11 MRMJobStart(784238,Msg,SC)
03/21 14:30:11 MPBSJobStart(784238,base,Msg,SC)
03/21 14:30:11 ERROR: job '784238' cannot be started: (rc: 15046 errmsg:
'Resource temporarily unavailable MSG=job allocation request exceeds currently
available cluster nodes, 1 requested, 0 available' hostlist: 'fluid001:ppn=4')
03/21 14:30:11 ERROR: cannot start job '784238' in partition DEFAULT
03/21 14:30:11 MJobPReserve(784238,DEFAULT,ResCount,ResCountRej)
03/21 14:30:30 job '784238' State: Idle EState: Idle
QueueTime: Tue Mar 21 14:29:50
- checkjob knows that on this particular node there are 16 CPU cores and thinks
that 9 are in use:
checking node fluid001
State: Running (in current state for 00:00:00)
Expected State: Idle SyncDeadline: Sat Oct 24 14:26:40
Configured Resources: PROCS: 16 MEM: 62G SWAP: 62G DISK: 1M
Utilized Resources: SWAP: 10G
Dedicated Resources: PROCS: 9
Opsys: ubuntu Arch: x64
Speed: 1.00 Load: 15.030
Network: [DEFAULT]
Features: [NONE]
Attributes: [Batch]
Classes: [default 16:16][fwd 7:16][fwi 16:16][short 16:16][long
16:16][benchmark 16:16][fwo 16:16]
Total Time: INFINITY Up: INFINITY (98.92%) Active: INFINITY (93.87%)
Reservations:
Job '772551'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '772553'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '772555'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '772557'(x1) -6:05:29:39 -> 2:02:30:21 (8:08:00:00)
Job '779684'(x1) -2:20:22:38 -> 5:11:37:22 (8:08:00:00)
Job '779685'(x1) -2:20:22:38 -> 5:11:37:22 (8:08:00:00)
Job '781758'(x1) -1:19:54:49 -> 6:12:05:11 (8:08:00:00)
Job '783132'(x1) -1:00:19:39 -> 7:07:40:21 (8:08:00:00)
Job '783909'(x1) -6:19:42 -> 8:01:40:18 (8:08:00:00)
User 'fluid.0.0'(x1) -00:03:52 -> INFINITY ( INFINITY)
Blocked Resources@00:00:00 Procs: 7/16 (43.75%)
Blocked Resources@2:02:30:21 Procs: 11/16 (68.75%)
Blocked Resources@5:11:37:22 Procs: 13/16 (81.25%)
Blocked Resources@6:12:05:11 Procs: 14/16 (87.50%)
Blocked Resources@7:07:40:21 Procs: 15/16 (93.75%)
Blocked Resources@8:01:40:18 Procs: 16/16 (100.00%)
JobList: 772551,772553,772555,772557,779684,779685,781758,783132,783909
- with qstat I can see that there is only one free slot on the node and 15 are
used by the jobs:
qstat -ae -n | grep fluid001
fluid001/0
fluid001/9
fluid001/11
fluid001/13
fluid001/5,7
fluid001/14-15
fluid001/1
fluid001/2-4,6
fluid001/8,10
- The node has 9 running jobs, but the syntax of the allocations is still
misunderstood by maui.
Do I have to switch to a newer version of Torque? Currently I am using version
5.1.1.
Thanks in advance,
Henrik
> Am 22.08.2016 um 18:21 schrieb David Beer <[email protected]>:
>
> This incompatibility exists for all versions of Torque > 5. It has been fixed
> in the Maui source, but no official release has been made. You can grab the
> new source from svn:
>
> svn co svn://opensvn.adaptivecomputing.com/maui
> <http://opensvn.adaptivecomputing.com/maui>
>
> After that you can build it as you would a normal tarball.
>
> On Sat, Aug 20, 2016 at 3:59 AM, Guangping Zhang <[email protected]
> <mailto:[email protected]>> wrote:
> Dear all,
>
> I found the Torque 6.0.2 not work properly with Maui 3.3.1 time to time.
>
> And I found in the log file of maui that
>
> 08/20 17:14:12 INFO: PBS node node04 set to state Idle (free)
> 08/20 17:14:12 INFO: node 'node04' changed states from Running to Idle
> 08/20 17:14:12 MPBSNodeUpdate(node04,node04,Idle,NODE00)
> 08/20 17:14:12 INFO: node node04 has joblist '0-9/248.node00'
> 08/20 17:14:12 ALERT: cannot locate PBS job '0-9' (running on node node04)
>
> where 0-9 not jobs but the allocated procs for job 248.node00. So, will this
> prevent torque to work good along with maui ?
>
> Thanks for your discussion.
>
> /Guangping
>
>
> _______________________________________________
> torqueusers mailing list
> [email protected] <mailto:[email protected]>
> http://www.supercluster.org/mailman/listinfo/torqueusers
> <http://www.supercluster.org/mailman/listinfo/torqueusers>
>
>
>
>
> --
> David Beer | Torque Architect
> Adaptive Computing
> _______________________________________________
> torqueusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/torqueusers
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers