Have you tried to recompile maui with larger limits?

sed -i -e "/MAX_MRES/ s/1024/8192/g" include/moab.h     
sed -i -e "/MMAX_JOB/ s/4096/8192/g" ./include/msched.h

There might be others that need to be increased too.

r.

On Thursday, January 10, 2013 14:42:51 Mario Kadastik wrote:
> Hi,
> 
> this is a constant issue we have. Maui is unable to schedule all jobs, but
> there doesn't seem to be a fixed amount, but it varies. Sometimes the
> running peaks at 3200 sometimes 3700 sometimes 3900 sometimes 4100, no
> correlation found yet. A usual situation:
> 
> [root@torque-v-1 ~]# qstat -q
> 
> server: torque-v-1.local
> 
> Queue            Memory CPU Time Walltime Node  Run Que Lm  State
> ---------------- ------ -------- -------- ----  --- --- --  -----
> test               --   01:00:00 02:00:00   --    0   0 --   E R
> long               --   48:00:00 72:00:00   --  3249 756 --   E R
> short              --   01:00:00 02:00:00   --    0   0 --   E R
>                                                ----- -----
>                                                 3249   756
> 
> [root@torque-v-1 ~]# diagnose -t
>      DEFAULT [test 4122:4122]
> 
> [root@torque-v-1 ~]# pbsnodes -l free|wc -l
> 120
> 
> So as you can see there are more free cores than queued jobs. All our jobs
> are single core jobs with no requirements that would prohibit running
> (defaults are only used, Grid doesn't specify job requirements that would
> conflict).
> 
> The main reason seems to be this:
> 
> 01/10 14:36:20 MPBSWorkloadQuery(base,JCount,SC)
> 01/10 14:36:20 INFO:     job '2081730' changed states from Running to Hold
> 01/10 14:36:20 INFO:     job '2081809' changed states from Running to Hold
> 01/10 14:36:20 INFO:     job '2081810' changed states from Running to Hold
> 01/10 14:36:29 INFO:     3916 PBS jobs detected on RM base
> 01/10 14:36:29 INFO:     jobs detected: 3916
> 01/10 14:36:30 INFO:     total jobs selected (ALL): 647/3916 [State: 3269]
> 01/10 14:36:30 INFO:     total jobs selected (ALL): 647/3916 [State: 3269]
> 01/10 14:36:30 INFO:     total jobs selected in partition ALL: 647/647
> 01/10 14:36:30 INFO:     total jobs selected in partition ALL: 647/647
> 01/10 14:36:30 INFO:     total jobs selected in partition DEFAULT: 647/647
> 01/10 14:36:30 MRMJobStart(2081811,Msg,SC)
> 01/10 14:36:30 MPBSJobStart(2081811,base,Msg,SC)
> 01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,wn-v-2936.local)
> 01/10 14:36:30 MPBSJobModify(2081811,Resource_List,Resource,1)
> 01/10 14:36:30 INFO:     job '2081811' successfully started
> 01/10 14:36:30 MRMJobStart(2081735,Msg,SC)
> 01/10 14:36:30 MPBSJobStart(2081735,base,Msg,SC)
> 01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,wn-v-4556.local)
> 01/10 14:36:30 MPBSJobModify(2081735,Resource_List,Resource,1)
> 01/10 14:36:30 INFO:     job '2081735' successfully started
> 01/10 14:36:30 ERROR:    cannot create reservation for job '2081735'
> 01/10 14:36:30 ERROR:    cannot start job '2081735' in partition DEFAULT
> 01/10 14:36:30 MJobPReserve(2081735,DEFAULT,ResCount,ResCountRej)
> 01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
> 01/10 14:36:30 MJobPReserve(2081736,DEFAULT,ResCount,ResCountRej)
> 01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
> 01/10 14:36:30 MJobPReserve(2081815,DEFAULT,ResCount,ResCountRej)
> 01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
> 01/10 14:36:30 MJobPReserve(2081738,DEFAULT,ResCount,ResCountRej)
> 01/10 14:36:30 ALERT:    cannot create reservation in MJobReserve
> ...
> 
> This message of cannot create reservation follows in hundreds and then the
> whole scheduling restarts for the next cycle. As you can see it was able to
> start two jobs, but I assume those were the ones that had finished recently
> and then it filled the slots. We've not been able to figure out what causes
> this. Any ideas how to debug this would be welcome. If we force a job to
> run it'll run, but maui itself won't run them. The level at which it gets
> to this state varies as I mentioned, we've even seen once it almost fill
> the whole cluster.
> 
> Mario Kadastik, PhD
> Researcher
> 
> ---
>   "Physics is like sex, sure it may have practical reasons, but that's not
> why we do it" -- Richard P. Feynman
> 
> _______________________________________________
> mauiusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/mauiusers
-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
              phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
         Direct call: +47 77 64 62 56. email: [email protected]

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to