Found this post online : http://www.supercluster.org/pipermail/mauiusers/2010-February/004116.html
I also have JOBNODEMATCHPOLICY EXACTNODE and NODEACCESSPOLICY SINGLEJOB set in the configuration. Could this bug still be there with maui ? I tested with a smaller cluster size and let me explain the scenario again : This time I have a 6 node cluster with Torque-3.0.3 and Maui running. Additional configuration in my Maui configuration file : ---------- BACKFILLPOLICY FIRSTFIT RESERVATIONPOLICY CURRENTHIGHEST ENABLEMULTIREQJOBS TRUE JOBNODEMATCHPOLICY EXACTNODE NODEACCESSPOLICY SINGLEJOB ---------- Now I submit a job a 2 node job with following resource requirement : ---------- #PBS -l nodes=2,walltime=0:10:00 --------- This job starts on node1/0 + node2/0 Now, I submit another 4 node job with the following resource requirement : --------- #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 -------- This job is also started but with following resources : *node3/0 + node3/1*+ *node4/0 + node4/1* + node5/0 I would expect this job to use the resources as follows : node3/0 + node3/1 + node4/0 + node5/0 + node6/0 But it did not use node6 at all, instead it used node3 and node4 to put 2 procs on each of them and node5 with another proc. node6 remained idle. Is this a bug or some other configuration / setting is required ? Thanks, Kunal On Fri, Jun 1, 2012 at 3:57 PM, Kunal Rao <[email protected]> wrote: > I removed NODEALLOCATIONPOLICY and tried again, this time it started the > job but the node allocation was not as expected. > > The job needs 1 node with 2 proc and 3 nodes with 1 proc each. The > allocation was done on only 3 nodes. 2 with 2 procs and 1 with 1 proc. Not > sure if this is a bug or some conflicts in the configuration. > > My current additional configurations are : > > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > ENABLEMULTIREQJOBS TRUE > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SINGLEJOB > > I also tried with this, but still the same : > > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > > ENABLEMULTIREQJOBS TRUE > NODEALLOCATIONPOLICY PRIORITY > NODECFG[DEFAULT] PRIORITYF='APROCS' > JOBNODEMATCHPOLICY EXACTNODE > NODEACCESSPOLICY SINGLEJOB > > Any suggestions ? > > Thanks, > Kunal > > > > On Thu, May 31, 2012 at 10:26 PM, Kunal Rao <[email protected]> wrote: > >> I need NODEACCESSPOLICY, maybe I'll remove NODEALLOCATIONPOLICY and check >> tomorrow. >> >> Thanks, >> Kunal >> >> >> On Thu, May 31, 2012 at 10:23 PM, Ju JiaJia <[email protected]> wrote: >> >>> Seems all be ok. I think you could try to delete the additional >>> configuration in maui.cfg. like NODEALLOCATIONPOLICY, NODEACCESSPOLICY, >>> or use default or other options. >>> >>> >>> On Fri, Jun 1, 2012 at 9:59 AM, Kunal Rao <[email protected]> wrote: >>> >>>> Each node has 16 cores. TORQUE_HOME/sever_priv/nodes file has for each >>>> of the 10 nodes : >>>> >>>> <node_name> np=16 gpus=1 >>>> >>>> Thanks, >>>> Kunal >>>> >>>> >>>> On Thu, May 31, 2012 at 9:54 PM, Ju JiaJia <[email protected]> wrote: >>>> >>>>> How many cores on each of the 10 nodes ? I mean you are trying to >>>>> allocate 2 processors on one node. And how did you >>>>> configure TORQUE_HOME/server_priv/nodes ? >>>>> >>>>> >>>>> On Fri, Jun 1, 2012 at 8:54 AM, Kunal Rao <[email protected]> wrote: >>>>> >>>>>> Queue / Server configuration : >>>>>> >>>>>> --------------- >>>>>> >>>>>> qmgr -c 'p s' >>>>>> # >>>>>> # Create queues and set their attributes. >>>>>> # >>>>>> # >>>>>> # Create and define queue batch >>>>>> # >>>>>> create queue batch >>>>>> set queue batch queue_type = Execution >>>>>> set queue batch resources_default.nodes = 1 >>>>>> set queue batch resources_default.walltime = 01:00:00 >>>>>> set queue batch enabled = True >>>>>> set queue batch started = True >>>>>> # >>>>>> # Set server attributes. >>>>>> # >>>>>> set server scheduling = True >>>>>> set server acl_hosts = fire16 >>>>>> set server acl_roots = [email protected] >>>>>> set server managers = [email protected] >>>>>> set server operators = [email protected] >>>>>> set server default_queue = batch >>>>>> set server log_events = 511 >>>>>> set server mail_from = adm >>>>>> set server scheduler_iteration = 20 >>>>>> set server node_check_rate = 150 >>>>>> set server tcp_timeout = 6 >>>>>> set server mom_job_sync = True >>>>>> set server keep_completed = 300 >>>>>> set server allow_node_submit = True >>>>>> set server next_job_number = 6331 >>>>>> >>>>>> --------------- >>>>>> >>>>>> Job resource requirement : >>>>>> >>>>>> --------- >>>>>> >>>>>> #PBS -l nodes=1:ppn=2+3,walltime=0:05:00 >>>>>> >>>>>> --------- >>>>>> >>>>>> "pbsnodes -a" shows all the 10 nodes in "free" state. So, they are all >>>>>> accessible. >>>>>> >>>>>> Thanks, >>>>>> Kunal >>>>>> >>>>>> >>>>>> On 5/31/12, Ju JiaJia <[email protected]> wrote: >>>>>> > Please give your queue/server configuration and your job's >>>>>> resources need, >>>>>> > cpu/memory etc. And Does all the 10 nodes accessable? You can use >>>>>> pbsnodes >>>>>> > to check this. >>>>>> > >>>>>> > On Thu, May 31, 2012 at 10:53 PM, Kunal Rao <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> >> Hello, >>>>>> >> >>>>>> >> Please see the below message. I had posted it on maui users >>>>>> mailing list, >>>>>> >> but did not get any response, so thought of posting it here on >>>>>> torque >>>>>> >> users >>>>>> >> mailing list (incase someone would know). Kindly let me know if >>>>>> you have >>>>>> >> any comments / ideas / suggestions. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Kunal >>>>>> >> >>>>>> >> ---------- Forwarded message ---------- >>>>>> >> From: Kunal Rao <[email protected]> >>>>>> >> Date: Wed, May 23, 2012 at 2:30 PM >>>>>> >> Subject: Re: Multi-req job not starting >>>>>> >> To: [email protected] >>>>>> >> >>>>>> >> >>>>>> >> There was a similar post earlier : >>>>>> >> >>>>>> http://www.clusterresources.com/pipermail/mauiusers/2009-July/003930.html >>>>>> >> >>>>>> >> But did not find any response to it. Can anyone please provide >>>>>> some ideas >>>>>> >> / suggestion on this issue. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Kunal >>>>>> >> >>>>>> >> >>>>>> >> On Wed, May 23, 2012 at 2:26 PM, Kunal Rao <[email protected]> >>>>>> wrote: >>>>>> >> >>>>>> >>> Hello, >>>>>> >>> >>>>>> >>> I have a 10 node cluster. There are 3 jobs. 1 which needs 2 nodes >>>>>> ( with >>>>>> >>> 1 task per node ), another which needs 4 nodes (with 1 task per >>>>>> node) >>>>>> >>> and >>>>>> >>> the third one which needs 4 nodes ( with 2 task on 1 node and 1 >>>>>> task >>>>>> >>> each >>>>>> >>> on the other 3 nodes ). >>>>>> >>> >>>>>> >>> Additional configuration in maui.cfg is : >>>>>> >>> >>>>>> >>> BACKFILLPOLICY FIRSTFIT >>>>>> >>> RESERVATIONPOLICY CURRENTHIGHEST >>>>>> >>> >>>>>> >>> ENABLEMULTIREQJOBS TRUE >>>>>> >>> NODEALLOCATIONPOLICY MINRESOURCE >>>>>> >>> NODEACCESSPOLICY SINGLEJOB >>>>>> >>> JOBNODEMATCHPOLICY EXACTNODE >>>>>> >>> >>>>>> >>> I am observing that if the first 2 jobs are running, the third >>>>>> one does >>>>>> >>> not start ( even though 4 nodes are available ) until 1 of the >>>>>> jobs >>>>>> >>> complete. With checkjob -v <job_id> it shows the following output >>>>>> : >>>>>> >>> >>>>>> >>> ------------------ >>>>>> >>> >>>>>> >>> checking job 5791 (RM job '5791.fire16.csa.local') >>>>>> >>> >>>>>> >>> State: Idle >>>>>> >>> Creds: user:kunal group:kunal class:batch qos:DEFAULT >>>>>> >>> WallTime: 00:00:00 of 00:04:51 >>>>>> >>> SubmitTime: Wed May 23 11:52:04 >>>>>> >>> (Time Queued Total: 00:48:52 Eligible: 00:48:52) >>>>>> >>> >>>>>> >>> StartDate: 00:00:01 Wed May 23 12:40:57 >>>>>> >>> Total Tasks: 2 >>>>>> >>> >>>>>> >>> Req[0] TaskCount: 2 Partition: ALL >>>>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>>>> >>> NodeAccess: SINGLEJOB >>>>>> >>> TasksPerNode: 2 NodeCount: 1 >>>>>> >>> >>>>>> >>> Req[1] TaskCount: 3 Partition: ALL >>>>>> >>> Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 >>>>>> >>> Opsys: [NONE] Arch: [NONE] Features: [NONE] >>>>>> >>> Exec: '' ExecSize: 0 ImageSize: 0 >>>>>> >>> Dedicated Resources Per Task: PROCS: 1 >>>>>> >>> NodeAccess: SINGLEJOB >>>>>> >>> NodeCount: 3 >>>>>> >>> >>>>>> >>> >>>>>> >>> IWD: [NONE] Executable: [NONE] >>>>>> >>> Bypass: 5 StartCount: 0 >>>>>> >>> PartitionMask: [ALL] >>>>>> >>> Flags: RESTARTABLE >>>>>> >>> >>>>>> >>> Reservation '5791' (00:00:01 -> 00:04:52 Duration: 00:04:51) >>>>>> >>> PE: 5.00 StartPriority: 48 >>>>>> >>> cannot select job 5791 for partition DEFAULT (startdate in >>>>>> '00:00:01') >>>>>> >>> >>>>>> >>> ------------ >>>>>> >>> >>>>>> >>> What could be the reason for not starting this job ? How do I >>>>>> resolve >>>>>> >>> this ? >>>>>> >>> >>>>>> >>> Thanks, >>>>>> >>> Kunal >>>>>> >>> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> _______________________________________________ >>>>>> >> torqueusers mailing list >>>>>> >> [email protected] >>>>>> >> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>> >> >>>>>> >> >>>>>> > >>>>>> _______________________________________________ >>>>>> torqueusers mailing list >>>>>> [email protected] >>>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> torqueusers mailing list >>>>> [email protected] >>>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> torqueusers mailing list >>>> [email protected] >>>> http://www.supercluster.org/mailman/listinfo/torqueusers >>>> >>>> >>> >>> _______________________________________________ >>> torqueusers mailing list >>> [email protected] >>> http://www.supercluster.org/mailman/listinfo/torqueusers >>> >>> >> >
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
