Hello all,
I am setting up a cluster that has been moved from one university to
another. I think i'm having Maui issues associated with the new
hostname. The cluster operates on SuSE Linux 10.2, using Torque 2.8
and Maui 3.2. I have made a number of necessary changes and performed
exhaustive searches through both maui and torque forums, to no avail.
I've also looked over the maui and torue manuals, and am now slowly
losing sanity. Any advice is appreciated. Here is what I did:
1) updated maui.cfg SERVERHOST
2) updated torque server_name and mom_priv/config
3) updated /etc/hosts on master
4) updated network configurations
5) reconfigured pbs_server with new acl_hosts = new_hostname
At this point, I can ping nodes from the master. I logged into one
node and change it's /etc/hosts so that I can now ping the master and
other nodes from this node (this is the node I am working with and
submitting jobs to, before i make any changes on other nodes). mom
logs also indicate good communication. My problem occurs when I submit
a job - it hangs in que. I can sudo qrun no problem. At first,
checkjob indicated the reason was:
job is deferred. Reason: RMFailure (job cannot be started - cannot
set hostlist)
I tried a few things that did not work:
-releasejob
-running maui in simulation mode, which returned: ERROR: cannot
open user interface socket on port 42559
- Tried to manually set hostlist in maui.cfg via ' SRHOSTLIST[27]
node2 node3 ...' keeping in mind that the default is ALL. This gives a
different checkjob error:
PE: 1.00 StartPriority: 1
job cannot run in partition DEFAULT (idle procs do not meet
requirements : 0 of 1 procs found)
idle procs: 36 feasible procs: 0
Rejection Reasons: [State : 17][ReserveTime : 9]
Detailed Node Availability Information:
node2 rejected : State
...
node9 rejected : State
node10 rejected : ReserveTime
node11 rejected : ReserveTime
node12 rejected : ReserveTime
node13 rejected : ReserveTime
node14 rejected : State
node15 rejected : State
node16 rejected : State
node17 rejected : ReserveTime
node18 rejected : State
node19 rejected : ReserveTime
node20 rejected : ReserveTime
node21 rejected : ReserveTime
node22 rejected : ReserveTime
I checknode the node I submitted to:
checking node node22
State: Idle (in current state for 00:07:46)
Configured Resources: PROCS: 4 MEM: 7864M SWAP: 9803M DISK: 1M
Utilized Resources: [NONE]
Dedicated Resources: [NONE]
Opsys: DEFAULT Arch: [NONE]
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [general]
Attributes: [Batch]
Classes: [q1 4:4][batch 4:4]
Total Time: 00:07:19 Up: 00:07:19 (100.00%) Active: 00:00:00 (0.00%)
Reservations:
User '27.0.0'(x1) -00:07:46 -> 13:55:21 (14:03:07)
Blocked Resources@-00:07:46 Procs: 4/4 (100.00%)
User '27.1.0'(x1) 13:55:21 -> 1:13:55:21 (1:00:00:00)
Blocked Resources@13:55:21 Procs: 4/4 (100.00%)
User 'normal.0.0'(x1) -00:07:46 -> 13:55:21 (14:03:07)
Blocked Resources@-00:07:46 Procs: 4/4 (100.00%)
User 'normal.1.0'(x1) 13:55:21 -> 1:13:55:21 (1:00:00:00)
Blocked Resources@13:55:21 Procs: 4/4 (100.00%)
ALERT: node is overcommitted at time -00:07:46 (P: -4)
ALERT: node is overcommitted at time 13:55:21 (P: -4)
If I get rid of SRHOSTLIST[27] node2 node3... in maui.cfg, i get the
previous "RMFailure (job cannot be started - cannot set hostlist)".
Thus, for now, I am keeping this line active in maui.cfg so that I can
at least see job failure 'reasons'
Can anyone tell me why maui thinks all my nodes are overcommitted even
though I can for them to run with pbs?
Thanks in advance,
Enoch
p.s. Here's some config info that may be of use:
Torque
# Create queues and set their attributes.
#
#
# Create and define queue q1
#
create queue q1
set queue q1 queue_type = Execution
set queue q1 acl_users = ***
***
set queue q1 resources_default.nodes = 1
set queue q1 resources_default.walltime = 100:00:00
set queue q1 enabled = True
set queue q1 started = True
#
# Create and define queue batch (where *** indicates i have changed the output)
#
create queue batch
set queue batch queue_type = Execution
set queue batch acl_users = ***
***
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 100:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = ***
set server default_queue = q1
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.nodes = 1
set server resources_default.walltime = 100:00:00
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.8
--------------------------------------------------------------------------------
MAUI config:
# Maui version 3.2.6p13 (PID: 18804)
# global policies
REJECTNEGPRIOJOBS[0] FALSE
ENABLENEGJOBPRIORITY[0] FALSE
ENABLEMULTINODEJOBS[0] TRUE
ENABLEMULTIREQJOBS[0] FALSE
BFPRIORITYPOLICY[0] [NONE]
JOBPRIOACCRUALPOLICY QUEUEPOLICY
NODELOADPOLICY ADJUSTSTATE
USEMACHINESPEED FALSE
USESYSTEMQUEUETIME TRUE
USELOCALMACHINEPRIORITY FALSE
NODEUNTRACKEDLOADFACTOR 1.2
JOBNODEMATCHPOLICY[0] EXACTNODE
JOBMAXSTARTTIME[0] INFINITY
METAMAXTASKS[0] 0
NODESETPOLICY[0] [NONE]
NODESETATTRIBUTE[0] [NONE]
NODESETLIST[0]
NODESETDELAY[0] 00:00:00
NODESETPRIORITYTYPE[0] MINLOSS
NODESETTOLERANCE[0] 0.00
BACKFILLPOLICY[0] FIRSTFIT
BACKFILLDEPTH[0] 0
BACKFILLPROCFACTOR[0] 0
BACKFILLMAXSCHEDULES[0] 10000
BACKFILLMETRIC[0] PROCS
BFCHUNKDURATION[0] 00:00:00
BFCHUNKSIZE[0] 0
PREEMPTPOLICY[0] REQUEUE
MINADMINSTIME[0] 00:00:00
RESOURCELIMITPOLICY[0]
NODEAVAILABILITYPOLICY[0] COMBINED:[DEFAULT]
NODEALLOCATIONPOLICY[0] CPULOAD
TASKDISTRIBUTIONPOLICY[0] DEFAULT
RESERVATIONPOLICY[0] CURRENTHIGHEST
RESERVATIONRETRYTIME[0] 00:00:00
RESERVATIONTHRESHOLDTYPE[0] NONE
RESERVATIONTHRESHOLDVALUE[0] 0
FSPOLICY [NONE]
FSINTERVAL 12:00:00
FSDEPTH 8
FSDECAY 1.00
# Priority Weights
SERVICEWEIGHT[0] 1
TARGETWEIGHT[0] 1
CREDWEIGHT[0] 1
ATTRWEIGHT[0] 1
FSWEIGHT[0] 1
RESWEIGHT[0] 1
USAGEWEIGHT[0] 1
QUEUETIMEWEIGHT[0] 1
XFACTORWEIGHT[0] 0
SPVIOLATIONWEIGHT[0] 0
BYPASSWEIGHT[0] 0
TARGETQUEUETIMEWEIGHT[0] 0
TARGETXFACTORWEIGHT[0] 0
USERWEIGHT[0] 0
GROUPWEIGHT[0] 0
ACCOUNTWEIGHT[0] 0
QOSWEIGHT[0] 0
CLASSWEIGHT[0] 0
FSUSERWEIGHT[0] 0
FSGROUPWEIGHT[0] 0
FSACCOUNTWEIGHT[0] 0
FSQOSWEIGHT[0] 0
FSCLASSWEIGHT[0] 0
ATTRATTRWEIGHT[0] 0
ATTRSTATEWEIGHT[0] 0
NODEWEIGHT[0] 0
PROCWEIGHT[0] 0
MEMWEIGHT[0] 0
SWAPWEIGHT[0] 0
DISKWEIGHT[0] 0
PSWEIGHT[0] 0
PEWEIGHT[0] 0
WALLTIMEWEIGHT[0] 0
UPROCWEIGHT[0] 0
UJOBWEIGHT[0] 0
CONSUMEDWEIGHT[0] 0
REMAININGWEIGHT[0] 0
PERCENTWEIGHT[0] 0
XFMINWCLIMIT[0] 00:02:00
# partition DEFAULT policies
REJECTNEGPRIOJOBS[1] FALSE
ENABLENEGJOBPRIORITY[1] FALSE
ENABLEMULTINODEJOBS[1] TRUE
ENABLEMULTIREQJOBS[1] FALSE
BFPRIORITYPOLICY[1] [NONE]
JOBPRIOACCRUALPOLICY QUEUEPOLICY
NODELOADPOLICY ADJUSTSTATE
JOBNODEMATCHPOLICY[1]
JOBMAXSTARTTIME[1] INFINITY
METAMAXTASKS[1] 0
NODESETPOLICY[1] [NONE]
NODESETATTRIBUTE[1] [NONE]
NODESETLIST[1]
NODESETDELAY[1] 00:00:00
NODESETPRIORITYTYPE[1] MINLOSS
NODESETTOLERANCE[1] 0.00
# Priority Weights
XFMINWCLIMIT[1] 00:00:00
SRTASKCOUNT[0] 0
SRTPN[0] 0
SRRESOURCES[0] PROCS=-1;MEM=0;DISK=0;SWAP=0
SRDEPTH[0] 2
SRSTARTTIME[0] 00:00:00
SRENDTIME[0] 00:00:00
SRWSTARTTIME[0] 00:00:00
SRWENDTIME[0] 00:00:00
SRDAYS[0] ALL
SRHOSTLIST[0] node2 node3 node4 node5 node6 node7
node8 node9 node10 node11 node12 node13 node14 node15 node16 node17
node18 node19
node20 node21 node22 node23 node24 node25 node26 node27
SRCHARGEACCOUNT[0]
SRCFG[27] HOSTLIST=node2 node3 node4 node5
node6 node7 node8 node9 node10 node11 node12 node13 node14 node15
node16 node17 node
18 node19 node20 node21 node22 node23 node24 node25 node26 node27
RMAUTHTYPE[0] CHECKSUM
CLASSCFG[[NONE]] DEFAULT.FEATURES=[NONE]
CLASSCFG[[ALL]] DEFAULT.FEATURES=[NONE]
CLASSCFG[q1] DEFAULT.FEATURES=[NONE]
CLASSCFG[batch] DEFAULT.FEATURES=[NONE]
****skip node specific info****
# SERVER MODULES: MX
SERVERMODE NORMAL
SERVERNAME
SERVERHOST ***
SERVERPORT 42559
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGFILEROLLDEPTH 1
LOGLEVEL 4
LOGFACILITY fALL
SERVERHOMEDIR /usr/local/maui/
TOOLSDIR /usr/local/maui/tools/
LOGDIR /usr/local/maui/log/
STATDIR /usr/local/maui/stats/
LOCKFILE /usr/local/maui/maui.pid
SERVERCONFIGFILE /usr/local/maui/maui.cfg
CHECKPOINTFILE /usr/local/maui/maui.ck
CHECKPOINTINTERVAL 00:05:00
CHECKPOINTEXPIRATIONTIME 3:11:20:00
TRAPJOB
TRAPNODE
TRAPFUNCTION
RESDEPTH 24
RMPOLLINTERVAL 00:00:30
NODEACCESSPOLICY SHARED
ALLOCLOCALITYPOLICY [NONE]
SIMTIMEPOLICY [NONE]
ADMIN1 admin1 ***
ADMINHOSTS ALL
NODEPOLLFREQUENCY 0
DISPLAYFLAGS
DEFAULTDOMAIN
DEFAULTCLASSLIST [DEFAULT:1]
FEATURENODETYPEHEADER
FEATUREPROCSPEEDHEADER
FEATUREPARTITIONHEADER
DEFERTIME 1:00:00
DEFERCOUNT 24
DEFERSTARTCOUNT 1
JOBPURGETIME 0
NODEPURGETIME 2140000000
APIFAILURETHRESHHOLD 6
NODESYNCTIME 600
JOBSYNCTIME 600
JOBMAXOVERRUN 00:10:00
NODEMAXLOAD 0.0
PLOTMINTIME 120
PLOTMAXTIME 245760
PLOTTIMESCALE 11
PLOTMINPROC 1
PLOTMAXPROC 512
PLOTPROCSCALE 9
SCHEDCFG[] MODE=NORMAL SERVER=***
# RM MODULES: PBS SSS WIKI NATIVE
RMCFG[***] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:00:09 TYPE=PBS
SIMWORKLOADTRACEFILE workload
SIMRESOURCETRACEFILE resource
SIMAUTOSHUTDOWN OFF
SIMSTARTTIME 0
SIMSCALEJOBRUNTIME FALSE
SIMFLAGS
SIMJOBSUBMISSIONPOLICY CONSTANTJOBDEPTH
SIMINITIALQUEUEDEPTH 16
SIMWCACCURACY 0.00
SIMWCACCURACYCHANGE 0.00
SIMNODECOUNT 0
SIMNODECONFIGURATION NORMAL
SIMWCSCALINGPERCENT 100
SIMCOMRATE 0.10
SIMCOMTYPE ROUNDROBIN
COMINTRAFRAMECOST 0.30
COMINTERFRAMECOST 0.30
SIMSTOPITERATION -1
SIMEXITITERATION -1
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers