Greetings, I have posted this on both torque and maui user boards as I am unsure whether the issue is in maui or torque (although we had this same problem before we ran maui)
I am configuring a cluster for engineering simulation use at my office. We have two clusters (one with 12 nodes and 16 processors per node and the other is a 5 node cluster with 16 processors per node, except for a bigmem machine with 32 processors). I am only working on the 5 node cluster at this time, but the behavior I am dealing with is on both clusters. When the procs syntax is used, the system is defaulting to 1 process, even though procs is > 1. All nodes show free when issuing qnodes or pbsnodes -a and list the appropriate number of cpus defined in the nodes file. I have a simple test script: #!/bin/bash #PBS -S /bin/bash #PBS -l nodes=2:ppn=8 #PBS -j oe cat $PBS_NODEFILE This script prints out: pegasus.am1.mnet pegasus.am1.mnet pegasus.am1.mnet pegasus.am1.mnet pegasus.am1.mnet pegasus.am1.mnet pegasus.am1.mnet pegasus.am1.mnet amdfr1.am1.mnet amdfr1.am1.mnet amdfr1.am1.mnet amdfr1.am1.mnet amdfr1.am1.mnet amdfr1.am1.mnet amdfr1.am1.mnet amdfr1.am1.mnet Which is expected. When I change the PBS resource list to: #PBS -l procs=32 I get the following: pegasus.am1.mnet The machine filed created in /var/spool/torque/aux simply has 1 entry for 1 process, even though I requested 32. We have a piece of simulation software that REQUIRES the use of the "-l procs=n" syntax to function on the cluster. (ANSYS does not plan to permit changes to this until Release 16 in 2015) We are trying to use our cluster with Ansys RSM with CFX and Fluent. We are running torque 4.2.6.1 and Maui 3.3.1. My queue and server attributes are defined as follows: # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch resources_default.walltime = 01:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = titan1.am1.mnet set server managers = [email protected] set server managers += [email protected] set server operators = [email protected] set server operators += [email protected] set server default_queue = batch set server log_events = 511 set server mail_from = adm set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 300 set server job_stat_rate = 45 set server poll_jobs = True set server mom_job_sync = True set server keep_completed = 300 set server submit_hosts = titan1.am1.mnet set server next_job_number = 8 set server moab_array_compatible = True set server nppcu = 1 My torque nodes file is: titan1.am1.mnet np=16 RAM64GB titan2.am1.mnet np=16 RAM64GB amdfl1.am1.mnet np=16 RAM64GB amdfr1.am1.mnet np=16 RAM64GB pegasus.am1.mnet np=32 RAM128GB Our maui.cfg file is: # maui.cfg 3.3.1 SERVERHOST titan1.am1.mnet # primary admin must be first in list ADMIN1 root kevin ADMIN3 ALL # Resource Manager Definition RMCFG[TITAN1.AM1.MNET] TYPE=PBS # Allocation Manager Definition AMCFG[bank] TYPE=NONE # full parameter docs at http://supercluster.org/mauidocs/a.fparameters.html # use the 'schedctl -l' command to display current configuration RMPOLLINTERVAL 00:00:30 SERVERPORT 42559 SERVERMODE NORMAL # Admin: http://supercluster.org/mauidocs/a.esecurity.html LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 3 # Job Priority: http://supercluster.org/mauidocs/5.1jobprioritization.html QUEUETIMEWEIGHT 1 # FairShare: http://supercluster.org/mauidocs/6.3fairshare.html #FSPOLICY PSDEDICATED #FSDEPTH 7 #FSINTERVAL 86400 #FSDECAY 0.80 # Throttling Policies: http://supercluster.org/mauidocs/6.2throttlingpolicies.html # NONE SPECIFIED # Backfill: http://supercluster.org/mauidocs/8.2backfill.html BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST # Node Allocation: http://supercluster.org/mauidocs/5.2nodeallocation.html NODEALLOCATIONPOLICY MINRESOURCE # Kevin's Modifications: JOBNODEMATCHPOLICY EXACTNODE # QOS: http://supercluster.org/mauidocs/7.3qos.html # QOSCFG[hi] PRIORITY=100 XFTARGET=100 FLAGS=PREEMPTOR:IGNMAXJOB # QOSCFG[low] PRIORITY=-1000 FLAGS=PREEMPTEE # Standing Reservations: http://supercluster.org/mauidocs/7.1.3standingreservations.html # SRSTARTTIME[test] 8:00:00 # SRENDTIME[test] 17:00:00 # SRDAYS[test] MON TUE WED THU FRI # SRTASKCOUNT[test] 20 # SRMAXTIME[test] 0:30:00 # Creds: http://supercluster.org/mauidocs/6.1fairnessoverview.html # USERCFG[DEFAULT] FSTARGET=25.0 # USERCFG[john] PRIORITY=100 FSTARGET=10.0- # GROUPCFG[staff] PRIORITY=1000 QLIST=hi:low QDEF=hi # CLASSCFG[batch] FLAGS=PREEMPTEE # CLASSCFG[interactive] FLAGS=PREEMPTOR Our MOM config file is: $pbsserver 10.0.0.10 # IP address of titan1.am1.mnet $clienthost 10.0.0.10 # IP address of management node $usecp *:/home/kevin /home/kevin $usecp *:/home /home $usecp *:/root /root $usecp *:/home/mpi /home/mpi $tmpdir /home/mpi/tmp I am finding it difficult to identify the configuration issue. I thought this thread would help: http://comments.gmane.org/gmane.comp.clustering.maui.user/2859 but their examples show the machine file is working correctly and they are battling memory allocations. I can't seem to get that far yet. Any thoughts? -- Kevin Sutherland Simulations Specialist
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
