Re: [Mauiusers] Strange queue/scheduler issue

Roy Dragseth Mon, 07 Nov 2011 14:56:20 -0800

Have you tried to set a limitation on the number of Idle jobs allowed per 
user?


For instance

USRCFG[DEFAULT] MAXIJOB=10

r.


On Monday 7. November 2011 23.44.47 Ian Miller wrote:
> Not sure if this is the correct forum for this but
> We have  320 core Grid with Maui & torque running.  Three queues are setup
> up with two nodes (24 core) for one of them and another with two
> exclusively and three nodes sharing the default queue.
> When someone submit say 4000 jobs to the default queue.  No one can submit
> any jobs to either of the other queue.  They just sit in Q status.   This
> started about three days ago and the users and total in an uproar about
> it.
> 
> Any thought would on where to find the bottle neck of a config setting
> would be helpful.
> 
> -I
> 
> 
> 
> Ian Miller
> System Administrator
> [email protected]
> 312-282-6507
> 
> 
> 
> 
> 
> 
> On 10/26/11 6:07 PM, "[email protected]" <[email protected]>
> 
> wrote:
> >Hi Lance,
> >
> >Does maui locate appropriate nodes if you specify:
> >-l procs=24,vmem=29600mb
> >?
> >That's what I'd do.  It will not limit the memory per process (loosely
> >speaking) but the main problem is which nodes are allocated.
> >
> >Gareth
> >
> >> -----Original Message-----
> >> From: Lance Westerhoff [mailto:[email protected]]
> >> Sent: Thursday, 27 October 2011 2:31 AM
> >> To: [email protected]
> >> Subject: [Mauiusers] torque/maui disregarding pmem with procs
> >> 
> >> 
> >> Hello all-
> >> 
> >> (I sent this email to the torque list, but I'm wondering if it might be
> >> a maui problem).
> >> 
> >> We are trying to use procs= and pmem= on an 18 node (152core) cluster
> >> with nodes of various memory size. pbsnodes shows the correct memory
> >> complement for each node, so apparently PBS is getting the right specs
> >> (see the output of pbsnodes below for more information). If we use the
> >> following settings in the PBS script, invariably torque/maui will try
> >> to fill up the all 8 of the 8 cores of each node. That is even though
> >> there is nowhere near enough memory on any of these nodes for
> >> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB
> >> to 24GB depending upon the node, this is just taking down nodes left
> >> and right.
> >> 
> >> Below I have provided a small example along with the associated output.
> >> I also provided the output for pbsnodes in case there is something I am
> >> missing here.
> >> 
> >> Thanks for your help!  -Lance
> >> 
> >> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same
> >> problem.
> >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail
> >> in terms of the procs option and it only asks for a single CPU)
> >> 
> >> $ cat tmp.pbs
> >> #!/bin/bash
> >> #PBS -S /bin/bash
> >> #PBS -l procs=24
> >> #PBS -l pmem=3700mb
> >> #PBS -l walltime=6:00:00
> >> #PBS -j oe
> >> 
> >> cat $PBS_NODEFILE
> >> 
> >> $ qsub tmp.pbs
> >> 337003.XXXX
> >> $ wc -l tmp.pbs.o337003
> >> 24 tmp.pbs.o337003
> >> $ cat tmp.pbs.o337003
> >> compute-0-14
> >> compute-0-14
> >> compute-0-14
> >> compute-0-14
> >> compute-0-14
> >> compute-0-14
> >> compute-0-14
> >> compute-0-14
> >> compute-0-15
> >> compute-0-15
> >> compute-0-15
> >> compute-0-15
> >> compute-0-15
> >> compute-0-15
> >> compute-0-15
> >> compute-0-15
> >> compute-0-16
> >> compute-0-16
> >> compute-0-16
> >> compute-0-16
> >> compute-0-16
> >> compute-0-16
> >> compute-0-16
> >> compute-0-16
> >> 
> >> $ pbsnodes -a
> >> compute-0-16
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l
> >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255
> >> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-15
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo
> >> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557
> >> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-14
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l
> >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255
> >> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-13
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l
> >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255
> >> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-12
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load
> >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232
> >> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-11
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load
> >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232
> >> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-9
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load
> >> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232
> >> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-8
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load
> >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232
> >> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-7
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load
> >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232
> >> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-6
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load
> >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232
> >> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-5
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load
> >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232
> >> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-4
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load
> >> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176
> >> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-3
> >> 
> >>    state = free
> >>    np = 8
> >>    properties = lustre
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo
> >> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364
> >> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-2
> >> 
> >>    state = free
> >>    np = 8
> >>    properties = lustre
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load
> >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488
> >> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-1
> >> 
> >>    state = free
> >>    np = 8
> >>    properties = lustre
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load
> >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488
> >> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-0
> >> 
> >>    state = free
> >>    np = 8
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load
> >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488
> >> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-10
> >> 
> >>    state = free
> >>    np = 2
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo
> >> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502
> >> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux
> >> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >> 
> >> compute-0-17
> >> 
> >>    state = free
> >>    np = 8
> >>    properties = testbox
> >>    ntype = cluster
> >>    status =
> >> 
> >> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load
> >> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k
> >> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux
> >> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT
> >> 2011 x86_64,opsys=linux
> >> 
> >>    gpus = 0
> >
> >_______________________________________________
> >mauiusers mailing list
> >[email protected]
> >http://www.supercluster.org/mailman/listinfo/mauiusers
> 
> _______________________________________________
> mauiusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/mauiusers

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
              phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
         Direct call: +47 77 64 62 56. email: [email protected]
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Re: [Mauiusers] Strange queue/scheduler issue

Reply via email to