Hi,
I have installed a mini test cluster with torque and maui. We have used
maui/torque for years on our grid cluster and now we are upgrading to
torque 2.5.7 and maui 3.3-4. Unfortunately with this new combination
maui doesn't seem to work correctly. When I submit jobs and it behaves
as if there weren't any free resources. Even when I tried to install
only torque and maui with a bare minimum configuration I got the same
behaviour, i.e.
1) When I submit the jobs just remain queued
//[root@//<server> maui]# /qstat -an1//
//
//<server>: //
//Req'd Req'd Elap//
//Job ID Username Queue Jobname SessID NDS TSK Memory
Time S Time//
//-------------------- -------- -------- ---------------- ------ -----
--- ------ ----- - -----//
//10.<server> aforti long pbs-vm3.sh -- -- -- --
-- Q -- -- //
//11.s<server> aforti long pbs-vm3.sh -- -- -- --
-- Q -- -- /
2) If I run qrun <jobid> the job runs so I assume the problem is not
between torque server and torque mom.
3) When I use showq on the old versions displayed the WCLimit of the
default queue now it displays 0 at first and then it changes it by
itself to 100 days
/[root@//<server> maui]# showq//
//ACTIVE JOBS--------------------//
//JOBNAME USERNAME STATE PROC REMAINING
STARTTIME//
//
//
// 0 Active Jobs 0 of 16 Processors Active (0.00%)//
// 0 of 1 Nodes Active (0.00%)//
//
//IDLE JOBS----------------------//
//JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME//
//
//10 aforti Idle 1 99:23:59:59 Tue Oct 9
15:32:13//
//11 aforti Idle 1 99:23:59:59 Tue Oct 9
16:39:09//
//
//2 Idle Jobs//
//
//BLOCKED JOBS----------------//
//JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME//
//
//
//Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2 Blocked Jobs: 0//
/
4) Checkjob <jobid> just tells me the job cannot be run in the default
partition without any particular reason
/[.....]
PE: 1.00 StartPriority: 120//
//cannot select job 10 for partition DEFAULT (Class)/
5) Checknode can see the node free if it wasn't clear from other commands
/[root@//<server> maui]# !checkno//
//checknode <node>//
//
//checking node <node>//
//
//State: Idle (in current state for 00:55:10)//
//Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G DISK: 1M//
//Utilized Resources: SWAP: 202M//
//Dedicated Resources: [NONE]//
//Opsys: linux Arch: [NONE]//
//Speed: 1.00 Load: 0.000//
//Network: [DEFAULT]//
//Features: [lcgpro]//
//Attributes: [Batch]//
//Classes: [DEFAULT 1:1]//
//
//Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active: 00:00:10 (0.09%)//
//
//Reservations://
//NOTE: no reservations on node/
6) When I use showbf -v though it says my nodes are blocked by
reservations despite checknode clearly telling me there are no
reservations on that node. In our local maui.cfg there is a reservation
for 1 proc I'm not sure why it blocks the whole node
/[root@//<server2> server_logs]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9
17:08:59//
//
// 3 procs available with no timelimit//
//
//node <node2> is blocked by reservation sft.0.0 in INFINITY//
/
But to be sure I removed it and even when I remove the reservation and
reduce the maui.cfg to the default version without anything in it it
tells me the node is blocked by "reservation NONE in INFINITY"
/[root@//<server> maui]# showbf -v//
//backfill window (user: 'root' group: 'root' partition: ALL) Tue Oct 9
17:37:58//
//
// 16 procs available with no timelimit//
//
//node <node> is blocked by reservation NONE in INFINITY//
/
I'm not sure how to proceed because the log files don't tell me anything
and all the references I have found to a similar problem have remained
unanswered.
Thanks for any help here are the rpms I used
/maui-3.3-4.el5//
//maui-client-3.3-4.el5//
//maui-server-3.3-4.el5//
//torque-2.5.7-7.el5//
//torque-client-2.5.7-7.el5//
//torque-server-2.5.7-7.el5//
//libtorque-2.5.7-7.el5//
/
the maui.cfg
/#
# MAUI configuration example
# @(#)maui.cfg David Groep 20031015.1
# for MAUI version 3.2.5
#
SERVERHOST <server>/
/ADMIN1 root
ADMINHOST <server>/
/RMTYPE[0] PBS
RMHOST[0] <server>/
/RMSERVER[0] <server>/
/
SERVERPORT 40559
SERVERMODE NORMAL
# Set PBS server polling interval. Since we have many short jobs
# and want fast turn-around, set this to 10 seconds (default: 2 minutes)
RMPOLLINTERVAL 00:00:10
# a max. 10 MByte log file in a logical location
LOGFILE /var/log/maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3/
and Torque config
/create queue long//
//set queue long queue_type = Execution//
//set queue long acl_hosts = localhost//
//set queue long acl_hosts += <server>//
//set queue long resources_max.cput = 48:00:00//
//set queue long resources_max.walltime = 72:00:00//
//set queue long acl_group_enable = True//
//set queue long acl_groups = aforti//
//set queue long enabled = True//
//set queue long started = True//
//#//
//# Set server attributes.//
//#//
//set server scheduling = True//
//set server acl_host_enable = False//
//set server acl_hosts = <server>//
//set server acl_hosts += localhost//
//set server default_queue = long//
//set server log_events = 511//
//set server mail_from = adm//
//set server next_job_number = 12/
--
Facts aren't facts if they come from the wrong people. (Paul Krugman)
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers