Hi,
I'm having trouble with maui that's from EMI-1 repository. It namely tends to
schedule only up to a certain amount of jobs and then doesn't schedule more
jobs even though there are free slots. The maui log shows that it tries to
schedule jobs, but fails to make reservations:
10/31 19:49:45 INFO: 162 PBS resources detected on RM base
10/31 19:49:45 INFO: resources detected: 162
10/31 19:49:45 MPBSWorkloadQuery(base,JCount,SC)
10/31 19:50:06 INFO: processing node request line '1'
10/31 19:50:06 INFO: job '1046246' loaded: 1 cms225 cms 259200
Idle 0 1351705749 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue]
1351705785
10/31 19:50:06 INFO: processing node request line '1'
10/31 19:50:06 INFO: job '1046247' loaded: 1 cms225 cms 259200
Idle 0 1351705750 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue]
1351705785
10/31 19:50:06 INFO: processing node request line '1'
10/31 19:50:06 INFO: job '1046248' loaded: 1 cms225 cms 259200
Idle 0 1351705752 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue]
1351705785
10/31 19:50:06 INFO: processing node request line '1'
10/31 19:50:06 INFO: job '1046249' loaded: 1 cms225 cms 259200
Idle 0 1351705756 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue]
1351705785
10/31 19:50:06 INFO: processing node request line '1'
10/31 19:50:06 INFO: job '1046250' loaded: 1 cms225 cms 259200
Idle 0 1351705770 [NONE] [NONE] [NONE] >= 0 >= 0 [longqueue]
1351705785
10/31 19:50:06 INFO: active PBS job 1041018 has been removed from the
queue. assuming successful completion
10/31 19:50:06 INFO: active PBS job 1041187 has been removed from the
queue. assuming successful completion
10/31 19:50:06 INFO: active PBS job 1044863 has been removed from the
queue. assuming successful completion
10/31 19:50:06 INFO: active PBS job 1044890 has been removed from the
queue. assuming successful completion
10/31 19:50:06 INFO: active PBS job 1044916 has been removed from the
queue. assuming successful completion
10/31 19:50:06 INFO: active PBS job 1045212 has been removed from the
queue. assuming successful completion
10/31 19:50:06 INFO: 4982 PBS jobs detected on RM base
10/31 19:50:06 INFO: jobs detected: 4982
10/31 19:50:07 INFO: total jobs selected (ALL): 848/4982 [State: 4134]
10/31 19:50:07 INFO: total jobs selected (ALL): 848/4982 [State: 4134]
10/31 19:50:07 INFO: total jobs selected in partition ALL: 848/848
10/31 19:50:07 INFO: total jobs selected in partition ALL: 848/848
10/31 19:50:07 INFO: total jobs selected in partition DEFAULT: 848/848
10/31 19:50:07 MRMJobStart(1045241,Msg,SC)
10/31 19:50:07 MPBSJobStart(1045241,base,Msg,SC)
10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,wn-v-4196.local)
10/31 19:50:07 MPBSJobModify(1045241,Resource_List,Resource,1)
10/31 19:50:07 INFO: job '1045241' successfully started
10/31 19:50:07 MRMJobStart(1045242,Msg,SC)
10/31 19:50:07 MPBSJobStart(1045242,base,Msg,SC)
10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,wn-v-6068.local)
10/31 19:50:07 MPBSJobModify(1045242,Resource_List,Resource,1)
10/31 19:50:07 INFO: job '1045242' successfully started
10/31 19:50:07 ERROR: cannot create reservation for job '1045242'
10/31 19:50:07 ERROR: cannot start job '1045242' in partition DEFAULT
10/31 19:50:07 MJobPReserve(1045242,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT: cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045243,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT: cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045244,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT: cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045245,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT: cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045247,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT: cannot create reservation in MJobReserve
10/31 19:50:07 MJobPReserve(1045246,DEFAULT,ResCount,ResCountRej)
10/31 19:50:07 ALERT: cannot create reservation in MJobReserve
The queues show this:
[root@torque-v-1 log]# qstat -q
server: torque-v-1.local
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
test -- 01:00:00 02:00:00 -- 0 0 -- E R
long -- 48:00:00 72:00:00 -- 4101 974 -- E R
short -- 01:00:00 02:00:00 -- 2 0 -- E R
----- -----
4103 974
[root@torque-v-1 log]#
There are free slots however:
[root@torque-v-1 log]# diagnose -t
DEFAULT [test 5427:5427]
All slots are configured for short and long queue (why they don't show up in
diagnose -t is beyond me, but ...). Ideas are welcome. I've seen the scheduling
to get stuck at around 3500-3700 running jobs, now after a maintenance downtime
where the job count reached 0 this number seems to be around 4100-4300 jobs. I
have seen 4930 running jobs a while ago, but that's not been possible recently.
The maui is:
[root@torque-v-1 log]# rpm -qa|grep maui
maui-3.2.6p21-snap.1234905291.5.el5
maui-client-3.2.6p21-snap.1234905291.5.el5
maui-server-3.2.6p21-snap.1234905291.5.el5
PS! if you received this twice, sorry ... wasn't sure my original mail got
through...
Thanks in advance,
Mario Kadastik, PhD
Researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why
we do it"
-- Richard P. Feynman
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers