I manage a moderate cluster running SLC5.5 which uses maui and pbs. After upgrading the head node to SLC5.5 and upgrading torque the batch system is behaving oddly.
Our 58 worker nodes are all identical 2x dual quad core boxes, i.e. they possess 8 cores each. For some reason, jobs will only be scheduled until 7 of the 8 cores are being used. Checking one of the queued jobs, I find that it is not being scheduled because it claims that there are no free CPUs: [root@######### server_priv]# checkjob -v 24933 checking job 24933 (RM job '24933.###########') State: Idle Creds: user:szczypka group:###### class:long qos:DEFAULT WallTime: 00:00:00 of 83:08:00:00 SubmitTime: Thu Feb 10 16:39:36 (Time Queued Total: 00:04:07 Eligible: 00:04:07) Total Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 MEM: 1000M NodeAccess: SHARED NodeCount: 0 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE PE: 1.00 StartPriority: 1 job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found) idle procs: 464 feasible procs: 0 Rejection Reasons: [CPU : 58] Detailed Node Availability Information: n01 rejected : CPU n02 rejected : CPU ... n57 rejected : CPU n58 rejected : CPU Yet this is clearly not the case. Removing all the default resource requirements has no effect. Interestingly, should I flood the cluster with jobs requiring 2, 4 or 8 processors (e.g. qsub -l nodes=1:ppn=8) then the jobs will fill the cluster entirely. Below is our maui.cfg: """ SERVERHOST ########## ADMIN1 root ADMIN3 ALL RMCFG[############] TYPE=PBS AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:00:10 SERVERPORT 42559 SERVERMODE NORMAL LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 FSPOLICY DEDICATEDPES FSDEPTH 30 FSINTERVAL 2:00:00 FSDECAY 0.80 FSWEIGHT 500 FSUSERWEIGHT 10 JOBPRIOACCRUALPOLICY ALWAYS XFACTORWEIGHT 3 XFWEIGHT 7 XFCAP 1000000 XFMINWCLIMIT 0:01:00 BACKFILLPOLICY BESTFIT RESERVATIONPOLICY CURRENTHIGHEST NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF='-1 * JOBCOUNT' USERWEIGHT 1 USERCFG[DEFAULT] PRIORITY=10000 USERCFG[DEFAULT] FSTARGET=10.0 CLASSCFG[data] PRIORITY=100000 CLASSCFG[align] PRIORITY=100000 CLASSCFG[data5] PRIORITY=100000 CLASSCFG[dirac] MAXJOB=5 """ Does anyone have any advice or thoughts on what might be causing this? Thanks, Paul.
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
