Have you tried to recompile maui with larger limits? sed -i -e "/MAX_MRES/ s/1024/8192/g" include/moab.h sed -i -e "/MMAX_JOB/ s/4096/8192/g" ./include/msched.h
There might be others that need to be increased too. r. On Thursday, January 10, 2013 14:42:51 Mario Kadastik wrote: > Hi, > > this is a constant issue we have. Maui is unable to schedule all jobs, but > there doesn't seem to be a fixed amount, but it varies. Sometimes the > running peaks at 3200 sometimes 3700 sometimes 3900 sometimes 4100, no > correlation found yet. A usual situation: > > [root@torque-v-1 ~]# qstat -q > > server: torque-v-1.local > > Queue Memory CPU Time Walltime Node Run Que Lm State > ---------------- ------ -------- -------- ---- --- --- -- ----- > test -- 01:00:00 02:00:00 -- 0 0 -- E R > long -- 48:00:00 72:00:00 -- 3249 756 -- E R > short -- 01:00:00 02:00:00 -- 0 0 -- E R > ----- ----- > 3249 756 > > [root@torque-v-1 ~]# diagnose -t > DEFAULT [test 4122:4122] > > [root@torque-v-1 ~]# pbsnodes -l free|wc -l > 120 > > So as you can see there are more free cores than queued jobs. All our jobs > are single core jobs with no requirements that would prohibit running > (defaults are only used, Grid doesn't specify job requirements that would > conflict). > > The main reason seems to be this: > > 01/10 14:36:20 MPBSWorkloadQuery(base,JCount,SC) > 01/10 14:36:20 INFO: job '2081730' changed states from Running to Hold > 01/10 14:36:20 INFO: job '2081809' changed states from Running to Hold > 01/10 14:36:20 INFO: job '2081810' changed states from Running to Hold > 01/10 14:36:29 INFO: 3916 PBS jobs detected on RM base > 01/10 14:36:29 INFO: jobs detected: 3916 > 01/10 14:36:30 INFO: total jobs selected (ALL): 647/3916 [State: 3269] > 01/10 14:36:30 INFO: total jobs selected (ALL): 647/3916 [State: 3269] > 01/10 14:36:30 INFO: total jobs selected in partition ALL: 647/647 > 01/10 14:36:30 INFO: total jobs selected in partition ALL: 647/647 > 01/10 14:36:30 INFO: total jobs selected in partition DEFAULT: 647/647 > 01/10 14:36:30 MRMJobStart(2081811,Msg,SC) > 01/10 14:36:30 MPBSJobStart(2081811,base,Msg,SC) > 01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,wn-v-2936.local) > 01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,1) > 01/10 14:36:30 INFO: job '2081811' successfully started > 01/10 14:36:30 MRMJobStart(2081735,Msg,SC) > 01/10 14:36:30 MPBSJobStart(2081735,base,Msg,SC) > 01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,wn-v-4556.local) > 01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,1) > 01/10 14:36:30 INFO: job '2081735' successfully started > 01/10 14:36:30 ERROR: cannot create reservation for job '2081735' > 01/10 14:36:30 ERROR: cannot start job '2081735' in partition DEFAULT > 01/10 14:36:30 MJobPReserve(2081735,DEFAULT,ResCount,ResCountRej) > 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve > 01/10 14:36:30 MJobPReserve(2081736,DEFAULT,ResCount,ResCountRej) > 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve > 01/10 14:36:30 MJobPReserve(2081815,DEFAULT,ResCount,ResCountRej) > 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve > 01/10 14:36:30 MJobPReserve(2081738,DEFAULT,ResCount,ResCountRej) > 01/10 14:36:30 ALERT: cannot create reservation in MJobReserve > ... > > This message of cannot create reservation follows in hundreds and then the > whole scheduling restarts for the next cycle. As you can see it was able to > start two jobs, but I assume those were the ones that had finished recently > and then it filled the slots. We've not been able to figure out what causes > this. Any ideas how to debug this would be welcome. If we force a job to > run it'll run, but maui itself won't run them. The level at which it gets > to this state varies as I mentioned, we've even seen once it almost fill > the whole cluster. > > Mario Kadastik, PhD > Researcher > > --- > "Physics is like sex, sure it may have practical reasons, but that's not > why we do it" -- Richard P. Feynman > > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: [email protected] _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
