Hi, I've been trying to setup a Torque+Maui without success.
The cluster here is composed of 70 nodes connected with InfiniBand (IB) and 5 nodes without IB but better processors. Each node have 2 quad-core processors with hyperthreading enabled for a total of 16 cores per node. I want to setup a queue to run on the non-IB nodes which I call the supernodes. There is 80 cores on these. Here is Torque's queue config: > set queue supernodes queue_type = Execution set queue supernodes resources_available.nodes = 5 set queue supernodes resources_default.neednodes = supernode set queue supernodes resources_default.walltime = 00:10:00 set queue supernodes enabled = True set queue supernodes started = True set server scheduling = True set server acl_hosts = mydomain.com set server operators = [email protected] set server operators += [email protected] set server default_queue = supernodes set server log_events = 511 set server mail_from = adm set server resources_available.ncpus = 1200 set server resources_available.nodect = 75 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server auto_node_np = False set server next_job_number = 90842 The node file "/var/spool/torque/server_priv/nodes" contains (for the super nodes): > node101 np=16 supernode node102 np=16 supernode node103 np=16 supernode node104 np=16 supernode node105 np=16 supernode I can submit and run a job only if the number of cpu requested is less or equal then the number of cores of a single node: -l nodes=1:ppn=5. But if I request 5 cores on 2 different nodes (-l nodes=2:ppn=5) the job stays in a "queued" state indefinitely. Maui's log file does not seems to report anything suspicious, even at log level 7. Checkjob reports which nodes it is scheduled to run on (node104 and node105): > $ checkjob 90843 checking job 90843 > State: Running Creds: user:me group:me class:supernodes qos:DEFAULT WallTime: 00:00:00 of 00:00:30 SubmitTime: Tue Nov 30 18:04:11 (Time Queued Total: 00:12:19 Eligible: 00:12:19) > StartTime: Tue Nov 30 18:16:30 Total Tasks: 10 > Req[0] TaskCount: 10 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [supernode] Allocated Nodes: [node105:5][node104:5] > > IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 739 PartitionMask: [ALL] Reservation '90843' (00:00:00 -> 00:00:30 Duration: 00:00:30) PE: 10.00 StartPriority: 12 But then checking each node using checknode the node seems idle: > $ checknode -v node104 checking node node104 > State: Idle (in current state for 00:00:00) Expected State: Running SyncDeadline: Tue Nov 30 18:27:34 Configured Resources: PROCS: 16 MEM: 23G SWAP: 23G DISK: 1M Utilized Resources: PROCS: 5 Dedicated Resources: PROCS: 5 Opsys: DEFAULT Arch: [NONE] Speed: 1.00 Load: 5.000 Location: Partition: DEFAULT Frame/Slot: 1/1 Network: [DEFAULT] Features: [supernode] Attributes: [Batch] Classes: [supernodes 11:16] > Total Time: INFINITY Up: INFINITY (73.79%) Active: 00:01:42 (0.00%) > Reservations: Job '90843'(x5) 00:00:00 -> 00:00:30 (00:00:30) JobList: 90843 > Google does not say much about the "ALERT: jobs active on node but state is Idle"... Does anyone have a clue? Thank you very much. Regards, Nicolas
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
