On Thu, 2006-04-06 at 16:18 +0100, Baker D.J. wrote: > Our cluster has recently got far more complex, and we have put some > interim scheduling policies in place until we can work out something > better. I wonder if there are cluster administrators in the community > who could advice us, please. I suspect others have similar set ups in > place. > > Basically we have one large torque/maui controlled cluster consisting > of... > > 1. Single core nodes -- all departments > 2. Dual core nodes -- all departments > 3. Departmental nodes -- 4 nodes for chemistry (dual cores), 8 nodes for > eScience (dual cores), 5 single core nodes for Sound/Vibration. > > The nodes in (1) are older, and all users can use them by default and > access is trivial. For the new nodes (2 and 3) we have devised a simple > scheme to control access based on switch boundaries. For example, for > nodes in (2), we have... > > NODECFG[purple301] FEATURES=switch10 > ... > NODECFG[purple332] FEATURES=switch11 > .. > > Switches 10, and 11 aren't defined in the maui NODESETLIST, and so users > must specify the appropriate switch(es) on their qsub command. Above all > we want to ensure that jobs don't ever grab a mix of nodes from (1), and > (2). Clunky, but works. > > For nodes in (3), again we have followed the same "switch" idea however > have also defined a standing reservation to limit user access. Also, of > course, users can ensure that their jobs can spill over into the main > facility by doing something like: > > qsub -W x=NODESET:ONEOF:FEATURE:switch10:switch11:escience ... > > Above all I think this scheme is clunky, and could be improved upon(?) > -- we are writing a script to hide the details, however. Could any one > with more experience of setting large systems please advise us by > suggesting possible set ups based on queues, partitions, etc. An > interesting question comes to mind...in a torque/maui system it is > possible for queued jobs to migrate from one queue to another if > resources are busy
There are multiple ways to do this. In the past, I've used a combination of partitions and standing reservations to segregate jobs onto different types of nodes based on their class/queue. For instance, on our Pentium 4 cluster we have 112 nodes with Infiniband hardware (partition "parallel" in Maui/Moab) and 144 nodes without (partition "serial"). There are corresponding queues in PBS/TORQUE that are routed to by a default queue called "batch"; users are encouraged not to specify a queue, but rather to specify the resources they need. Standing reservations then enforce access control on the nodes in the partitions to particular queues/classes of jobs, and the partitions keep jobs from spanning across multiple classes of nodes. In your case, it sounds like your fundamental access control is on the department level. If everyone in a department is in the same UNIX group, you can key off of that; however, a more likely scenario is that UNIX groups are set on research group boundaries, in which case you'll want to construct accounts (a sort of meta-group in Maui) for each department: ----- # partitions -- dualcore, chem, escience & sndvib # Single-core nodes won't have a partition ID set and will end up in # the DEFAULT partition. # Note that you will need to increase MMAX_MPAR to 6 in maui.h to allow # 4 user-defined partitions (plus ALL and DEFAULT). NODECFG[dualcore1] PARTITION=dualcore ... NODECFG[chem1] PARTITION=chem ... NODECFG[escience1] PARTITION=escience ... NODECFG[sndvib1] PARTITION=sndvib ... ACCOUNTCFG[DEFAULT] PLIST=DEFAULT:dualcore ACCOUNTCFG[chem] PLIST=DEFAULT:dualcore:chem ACCOUNTCFG[escience] PLIST=DEFAULT:dualcore:escience ACCOUNTCFG[sndvib] PLIST=DEFAULT:dualcore:sndvib GROUPCFG[grp1] ADEF=DEFAULT GROUPCFG[grp2] ADEF=chem GROUPCFG[grp3] ADEF=escience GROUPCFG[grp4] ADEF=sndvib GROUPCFG[grp5] ADEF=escience ... ----- I'm a little surprised that the Maui docs [1] don't list ADEF as a valid parameter to GROUPCFG, as I'm pretty I used it in Maui before we converted over to using Moab. [1] http://www.clusterresources.com/products/maui/docs/a.fparameters.shtml Hope this helps, --Troy -- Troy Baer [EMAIL PROTECTED] Science & Technology Support http://www.osc.edu/hpc/ Ohio Supercomputer Center 614-292-9701 _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
