Hi,

I'm having trouble with maui that's from EMI-1 repository. It namely tends to 
schedule only up to a certain amount of jobs and then doesn't schedule more 
jobs even though there are free slots. The maui log shows that it tries to 
schedule jobs, but fails to make reservations:

10/31 19:49:45 INFO:     162 PBS resources detected on RM base
10/31 19:49:45 INFO:     resources detected: 162
10/31 19:49:45 MPBSWorkloadQuery(base,JCount,SC)
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046246' loaded:   1   cms225      cms 259200     
  Idle   0 1351705749   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 
1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046247' loaded:   1   cms225      cms 259200     
  Idle   0 1351705750   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 
1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046248' loaded:   1   cms225      cms 259200     
  Idle   0 1351705752   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 
1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046249' loaded:   1   cms225      cms 259200     
  Idle   0 1351705756   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 
1351705785
10/31 19:50:06 INFO:     processing node request line '1'
10/31 19:50:06 INFO:     job '1046250' loaded:   1   cms225      cms 259200     
  Idle   0 1351705770   [NONE] [NONE] [NONE] >=      0 >=      0 [longqueue] 
1351705785
10/31 19:50:06 INFO:     active PBS job 1041018 has been removed from the 
queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1041187 has been removed from the 
queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1044863 has been removed from the 
queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1044890 has been removed from the 
queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1044916 has been removed from the 
queue.  assuming successful completion
10/31 19:50:06 INFO:     active PBS job 1045212 has been removed from the 
queue.  assuming successful completion
10/31 19:50:06 INFO:     4982 PBS jobs detected on RM base
10/31 19:50:06 INFO:     jobs detected: 4982
10/31 19:50:07 INFO:     total jobs selected (ALL): 848/4982 [State: 4134]
10/31 19:50:07 INFO:     total jobs selected (ALL): 848/4982 [State: 4134]
10/31 19:50:07 INFO:     total jobs selected in partition ALL: 848/848 
10/31 19:50:07 INFO:     total jobs selected in partition ALL: 848/848 
10/31 19:50:07 INFO:     total jobs selected in partition DEFAULT: 848/848 
10/31 19:50:07 MRMJobStart(1045241,Msg,SC)
10/31 19:50:07 MPBSJobStart(1045241,base,Msg,SC)
10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,wn-v-4196.local)
10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,1)
10/31 19:50:07 INFO:     job '1045241' successfully started
10/31 19:50:07 MRMJobStart(1045242,Msg,SC)
10/31 19:50:07 MPBSJobStart(1045242,base,Msg,SC)
10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,wn-v-6068.local)
10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,1)
10/31 19:50:07 INFO:     job '1045242' successfully started
10/31 19:50:07 ERROR:    cannot create reservation for job '1045242'
10/31 19:50:07 ERROR:    cannot start job '1045242' in partition DEFAULT
10/31 19:50:07 MJobPReserve(1045242,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045243,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045244,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045245,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045247,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045246,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT:    cannot create reservation in MJobReserve

The queues show this:
[root@torque-v-1 log]# qstat -q

server: torque-v-1.local

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
test               --   01:00:00 02:00:00   --    0   0 --   E R
long               --   48:00:00 72:00:00   --  4101 974 --   E R
short              --   01:00:00 02:00:00   --    2   0 --   E R
                                              ----- -----
                                               4103   974
[root@torque-v-1 log]# 

There are free slots however:
[root@torque-v-1 log]# diagnose -t
    DEFAULT [test 5427:5427]

All slots are configured for short and long queue (why they don't show up in 
diagnose -t is beyond me, but ...). Ideas are welcome. I've seen the scheduling 
to get stuck at around 3500-3700 running jobs, now after a maintenance downtime 
where the job count reached 0 this number seems to be around 4100-4300 jobs. I 
have seen 4930 running jobs a while ago, but that's not been possible recently. 

The maui is: 
[root@torque-v-1 log]# rpm -qa|grep maui
maui-3.2.6p21-snap.1234905291.5.el5
maui-client-3.2.6p21-snap.1234905291.5.el5
maui-server-3.2.6p21-snap.1234905291.5.el5

PS! if you received this twice, sorry ... wasn't sure my original mail got 
through...

Thanks in advance, 

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to