John-
While PBS can wait for a desired number of nodes to become available, it launches only a single task on a single node, and that process is somehow responsible for launching other tasks and utilizing the allocated nodes, which PBS has marked busy. Usually this is accomplished with mpirun (uses ssh) in the HPC world. PBS creates an environment on mother superior which contains a list of the nodes that PBS considers "allocated", and that machine list is available to distributed launchers like mpirun.
So the single node that PBS selects to launch on is designated "mother superior" (the execution host daemons are called "moms"). This seems ideal for the GA master.
Now for dynamically resizing the GA job, that's not something PBS has anything to do with. The GA job itself will have to handle that, and you'll have to have it work within the list of nodes that PBS allocated to it. This will prevent PBS from being able to launch other jobs on these nodes too. Rebooting those nodes (and killing the PBS execution host daemons) complicates things further.
Myself, I might do it like this:
Launch the GA job via PBS and only allocate 1 node (the master). Make the script running on the GA master capable of detecting PBS jobs running on the nodes, and when the nodes are rebooted, so it can automatically throttle back to relinquish the nodes. Also, have it detect when they're available again, so it can backfill. This is the way Condor behaves I think.
Hope that helps-
Jeremy
At 12:37 PM 9/9/2002 -0700, John Proakis wrote:
I have no experience with PBS and any other work managment system. We need PBS work on our 32-node OSCAR cluster to implement some kinds of work managementLet me describe it briefly. 1. We will always run a very large master-slave program on this cluster. The master will assign independent jobs to avaliable slave nodes and the slave node will process the job and return the result to the master node. This is the main utilization of this cluster. Let's call this large job GA. 2. Sometimes, we need to use this cluster to do other works. (1) We want to power off 8 nodes first, and then use those 8 nodes to do other jobs. And at that time we don't want the GA program to use those 8 nodes,and info the master nodes, those 8 nodes is not avaliable. After finishing our use,then info the master node that those 8 nodes come back. (2) Sometimes, we need to reserve 16 or more nodes to run other programs to get the speedup curves. We neefor serveral minutes, during this time, those 16 nodes are not avaliable for the GA program. And I have get a Administrator Guide from PBS website,but after read it I can not figure out how to do such a thing. Would you like to give me some suggestions? Thanks very much! __________________________________________________ Do You Yahoo!? Yahoo! Finance - Get real-time stock quotes http://finance.yahoo.com ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users </x-flowed>
