Hi all. We've been using Torque+Maui in our cluster for some time now. It's a small cluster, composed by 8 (now only 6 online) quad-core nodes plus a master machine.
The way the queue should be working is: 1 - All users has a maximum of 8 processors/cores to use at the same time, in *any* moment; 2 - Any (sub-)group of users (there are three) should not be able to use more than 16 processors/cores at the same time, in *any* moment. I'm quite sure I've tested the following configuration to be certain about this usage protocols. Unfortunatelly, due to some unexplained reasons so far, the cluster is just becoming to be heavy used now, more than a month after it was officially started. And now a really strange behaviour was noticed, which seems to be related to the way I configured maui: 1 - there are 6 jobs using 24 cores from users of the same group at the same time running. I guess that if there were more nodes available, the two queued processes from that same group would also start to run. 2 - There is one user with three jobs alone, using 12 cores. :( How can I correct this? Due to "internal policies", there is no problem in having a spare node available with no proceses (we intend to deal with it by implementing some sort of "wake on lan" procedure), but by no means a group or a user can go over this stablished limits. :( Here follows my maui.cfg. Just removed the server name for safety reasons: ******************* # maui.cfg 3.2.6p20 SERVERHOST server # primary admin must be first in list ADMIN1 root # Resource Manager Definition RMCFG[server] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html CLASSCFG[cluster] MAXPROC[GROUP]=16 MAXPROC[USER]=8 CLASSCFG[qm] MAXPROC[USER]=8 # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT #NONE RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE #CPULOAD ou FIRSTAVAILABLE ???!!! # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR ******************** Here is the "showq" output: *********************** ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 311 gullit Running 4 94:05:48:49 Fri Sep 4 21:27:41 313 msegala Running 4 96:08:39:28 Mon Sep 7 00:18:20 314 msegala Running 4 96:20:13:42 Mon Sep 7 11:52:34 318 ricksander Running 4 98:19:45:23 Wed Sep 9 11:24:15 320 msegala Running 4 98:23:29:31 Wed Sep 9 15:08:23 321 william Running 4 99:08:03:39 Wed Sep 9 23:42:31 6 Active Jobs 24 of 24 Processors Active (100.00%) 6 of 6 Nodes Active (100.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 322 gullit Idle 4 99:23:59:59 Thu Sep 10 08:51:42 323 gullit Idle 4 99:23:59:59 Thu Sep 10 11:08:50 2 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 8 Active Jobs: 6 Idle Jobs: 2 Blocked Jobs: 0 *********************** And here the "qstat -q" output: *********************** Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- qm -- -- -- -- 0 0 -- E R cluster -- -- -- -- 6 2 -- E R ----- ----- 6 2 *********************** Any clues here? By the way, is there any way to reinforce any corrections I make immediately, what would mean to automatically place the last started processes above in a "waiting" state? Thanks a lot in advance for any help with this matter! Sincerally yours, Jones
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
