Have you tried to set a limitation on the number of Idle jobs allowed per user?
For instance USRCFG[DEFAULT] MAXIJOB=10 r. On Monday 7. November 2011 23.44.47 Ian Miller wrote: > Not sure if this is the correct forum for this but > We have 320 core Grid with Maui & torque running. Three queues are setup > up with two nodes (24 core) for one of them and another with two > exclusively and three nodes sharing the default queue. > When someone submit say 4000 jobs to the default queue. No one can submit > any jobs to either of the other queue. They just sit in Q status. This > started about three days ago and the users and total in an uproar about > it. > > Any thought would on where to find the bottle neck of a config setting > would be helpful. > > -I > > > > Ian Miller > System Administrator > [email protected] > 312-282-6507 > > > > > > > On 10/26/11 6:07 PM, "[email protected]" <[email protected]> > > wrote: > >Hi Lance, > > > >Does maui locate appropriate nodes if you specify: > >-l procs=24,vmem=29600mb > >? > >That's what I'd do. It will not limit the memory per process (loosely > >speaking) but the main problem is which nodes are allocated. > > > >Gareth > > > >> -----Original Message----- > >> From: Lance Westerhoff [mailto:[email protected]] > >> Sent: Thursday, 27 October 2011 2:31 AM > >> To: [email protected] > >> Subject: [Mauiusers] torque/maui disregarding pmem with procs > >> > >> > >> Hello all- > >> > >> (I sent this email to the torque list, but I'm wondering if it might be > >> a maui problem). > >> > >> We are trying to use procs= and pmem= on an 18 node (152core) cluster > >> with nodes of various memory size. pbsnodes shows the correct memory > >> complement for each node, so apparently PBS is getting the right specs > >> (see the output of pbsnodes below for more information). If we use the > >> following settings in the PBS script, invariably torque/maui will try > >> to fill up the all 8 of the 8 cores of each node. That is even though > >> there is nowhere near enough memory on any of these nodes for > >> 8*3700mb=29600mb. Considering the physical memory limit goes from 8GB > >> to 24GB depending upon the node, this is just taking down nodes left > >> and right. > >> > >> Below I have provided a small example along with the associated output. > >> I also provided the output for pbsnodes in case there is something I am > >> missing here. > >> > >> Thanks for your help! -Lance > >> > >> torque version: tried 2.5.4, 2.5.8, and 3.0.2 - all exhibit the same > >> problem. > >> maui version: 3.2.6p21 (also tried maui 3.3.1 but it is a complete fail > >> in terms of the procs option and it only asks for a single CPU) > >> > >> $ cat tmp.pbs > >> #!/bin/bash > >> #PBS -S /bin/bash > >> #PBS -l procs=24 > >> #PBS -l pmem=3700mb > >> #PBS -l walltime=6:00:00 > >> #PBS -j oe > >> > >> cat $PBS_NODEFILE > >> > >> $ qsub tmp.pbs > >> 337003.XXXX > >> $ wc -l tmp.pbs.o337003 > >> 24 tmp.pbs.o337003 > >> $ cat tmp.pbs.o337003 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-14 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-15 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> compute-0-16 > >> > >> $ pbsnodes -a > >> compute-0-16 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219085,varattr=,jobs=,state=free,netload=1834011936,gres=,l > >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10095652kb,totmem=102255 > >> 76kb,idletime=5582,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-16.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-15 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=700017694,gres=,lo > >> adave=0.00,ncpus=8,physmem=8177300kb,availmem=10150996kb,totmem=1022557 > >> 6kb,idletime=5606,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-15.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-14 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=1003164957,gres=,l > >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10131180kb,totmem=102255 > >> 76kb,idletime=5615,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-14.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-13 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=1173266470,gres=,l > >> oadave=0.00,ncpus=8,physmem=8177300kb,availmem=10132104kb,totmem=102255 > >> 76kb,idletime=5637,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-13.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-12 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=3991477,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276448kb,totmem=14350232 > >> kb,idletime=5604,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-12.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-11 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2947879,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14274604kb,totmem=14350232 > >> kb,idletime=5588,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-11.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-9 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=3721396,gres=,load > >> ave=0.05,ncpus=8,physmem=12301956kb,availmem=14253816kb,totmem=14350232 > >> kb,idletime=5660,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-9.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-8 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2934478,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254796kb,totmem=14350232 > >> kb,idletime=5675,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-8.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-7 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2909406,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14254812kb,totmem=14350232 > >> kb,idletime=5489,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-7.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-6 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2936791,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14275644kb,totmem=14350232 > >> kb,idletime=5748,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-6.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-5 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2966183,gres=,load > >> ave=0.00,ncpus=8,physmem=12301956kb,availmem=14276260kb,totmem=14350232 > >> kb,idletime=5695,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-5.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-4 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2886627,gres=,load > >> ave=0.00,ncpus=8,physmem=16438900kb,availmem=18412332kb,totmem=18487176 > >> kb,idletime=5634,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-4.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-3 > >> > >> state = free > >> np = 8 > >> properties = lustre > >> ntype = cluster > >> status = > >> > >> rectime=1319219108,varattr=,jobs=,state=free,netload=436527254,gres=,lo > >> adave=0.00,ncpus=8,physmem=24688212kb,availmem=26636656kb,totmem=267364 > >> 88kb,idletime=2224,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-3.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-2 > >> > >> state = free > >> np = 8 > >> properties = lustre > >> ntype = cluster > >> status = > >> > >> rectime=1319219106,varattr=,jobs=,state=free,netload=1184385,gres=,load > >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26659668kb,totmem=26736488 > >> kb,idletime=2223,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-2.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-1 > >> > >> state = free > >> np = 8 > >> properties = lustre > >> ntype = cluster > >> status = > >> > >> rectime=1319219102,varattr=,jobs=,state=free,netload=1258074,gres=,load > >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26657304kb,totmem=26736488 > >> kb,idletime=2228,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-1.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-0 > >> > >> state = free > >> np = 8 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=3416356,gres=,load > >> ave=0.00,ncpus=8,physmem=24688212kb,availmem=26635624kb,totmem=26736488 > >> kb,idletime=5603,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-0.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-10 > >> > >> state = free > >> np = 2 > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=283846193,gres=,lo > >> adave=0.23,ncpus=8,physmem=12301956kb,availmem=13762696kb,totmem=143502 > >> 32kb,idletime=5622,nusers=1,nsessions=1,sessions=3410,uname=Linux > >> compute-0-10.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > >> > >> compute-0-17 > >> > >> state = free > >> np = 8 > >> properties = testbox > >> ntype = cluster > >> status = > >> > >> rectime=1319219090,varattr=,jobs=,state=free,netload=2948331,gres=,load > >> ave=0.00,ncpus=8,physmem=8177300kb,availmem=10144432kb,totmem=10225576k > >> b,idletime=5558,nusers=0,nsessions=? 0,sessions=? 0,uname=Linux > >> compute-0-17.local 2.6.18-274.7.1.el5 #1 SMP Thu Oct 20 16:21:01 EDT > >> 2011 x86_64,opsys=linux > >> > >> gpus = 0 > > > >_______________________________________________ > >mauiusers mailing list > >[email protected] > >http://www.supercluster.org/mailman/listinfo/mauiusers > > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: [email protected] _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
