<x-flowed>
John-
While PBS can wait for a desired number of nodes to become available, it launches only a single task on a single node, and that process is somehow responsible for launching other tasks and utilizing the allocated nodes, which PBS has marked busy. Usually this is accomplished with mpirun (uses ssh) in the HPC world. PBS creates an environment on mother superior which contains a list of the nodes that PBS considers "allocated", and that machine list is available to distributed launchers like mpirun.
So the single node that PBS selects to launch on is designated "mother superior" (the execution host daemons are called "moms"). This seems ideal for the GA master.
Now for dynamically resizing the GA job, that's not something PBS has anything to do with. The GA job itself will have to handle that, and you'll have to have it work within the list of nodes that PBS allocated to it. This will prevent PBS from being able to launch other jobs on these nodes too. Rebooting those nodes (and killing the PBS execution host daemons) complicates things further.
Myself, I might do it like this:
Launch the GA job via PBS and only allocate 1 node (the master). Make the script running on the GA master capable of detecting PBS jobs running on the nodes, and when the nodes are rebooted, so it can automatically throttle back to relinquish the nodes. Also, have it detect when they're available again, so it can backfill. This is the way Condor behaves I think.

Hope that helps-

Jeremy

At 12:37 PM 9/9/2002 -0700, John Proakis wrote:
I have no experience with PBS and any other work
managment system.

We need PBS work on our 32-node OSCAR cluster to
implement some kinds of work management

Let me describe it briefly.

1. We will always run a very large master-slave
program on this cluster. The master will assign
independent jobs to avaliable slave nodes and the
slave node will process the job and return the result
to the master node. This is the main utilization of
this cluster. Let's call this large job GA.

2. Sometimes, we need to use this cluster to do other
works.
(1) We want to power off 8 nodes first, and then use
those 8 nodes to do other jobs. And at that time
we don't want the GA program to use those 8 nodes,and
info the master nodes, those 8 nodes is not avaliable.
After finishing our use,then info the master node that
those 8 nodes come back.
(2) Sometimes, we need to reserve 16 or more nodes to
run other programs to get the speedup curves. We
neefor serveral minutes, during this time, those 16
nodes
are not avaliable for the GA program.


And I have get a Administrator Guide from PBS
website,but after read it I can not figure out how to
do such a thing.


Would you like to give me some suggestions?
Thanks very much!






__________________________________________________
Do You Yahoo!?
Yahoo! Finance - Get real-time stock quotes
http://finance.yahoo.com


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

</x-flowed>

Reply via email to