Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

Pak Lui Wed, 31 May 2006 19:34:37 -0400

Ralph Castain wrote:

First, the fact that an orted already exists on a node is notsufficient to allow us to use it again for another application. Theorted must be persistent or else we do not allow a new application tore-use it. This is required because the existing orted will go awaywhen its original application is done executing - if we use it as ourparent to launch another child, then the new application process will"die" when the original one completes. Obviously, that isn't desirable.
okay. I used to think that if orted is able to stay and fork otherprocesses, but I didn't realize orted will go away once the parentprocess finishes.
I don't know how to get around this problem for non-persistent orteds.Perhaps we can devise some mechanism. The problem is that mpirun needsto exit when it finishes the associated application. Without apersistent orted, mpirun serves as the parent process for everythingthat is executing, including the daemons. So, for mpirun to exit, thatmeans all of its children must also terminate.

wow, that's a thought. Do you mean that after we start a SGE(interactive/batch) job, we first have the user to fire up persistentorted, in order to have 'qrsh' to launch the persistent orted onto allof the SGE nodes, and have them running for the duration of the SGE job.So that way, the subsequent mpirun will not need to use qrsh again tolaunch on the remote nodes. I think that may actually solve the problem.

If we try to link one mpirun to another, then we have the problem thatwe must force the first mpirun to "stay alive" until the second onecompletes. This could be done, but seems problematic and contrary tonormal user expectations.


agree, that is not good.

Second, even though you can launch persistent orteds today, none ofthe current components in the resource management subsystems actuallyknow how to use them yet. This is something we planned to implementin the future, but there simply hasn't been time to do so yet.
So the bottom line is that there really is no way around the need tolaunch a new orted on each node every time the user issues an mpiruncommand.
I hope that answers your question. If not, please don't hesitate tolet me know.
Thanks for pointing out these issues. I was hoping something I didn'tknow may solve my problem. I guess there may not be a good workaroundfor this limitation due to SGE slots. We could try to track and setsome top limit for the number of times that qrsh can exec, before thespawn program uses up all the available SGE slots and errors out.
Hmmm...it sounds to me like the problem here is that the second OpenRTEuniverse (the one created by the second mpirun) has no knowledge of whatthe other universe may already have done.

it sounds like a good assumption but it actually turns out that theorted for both the spawner (parent) and spawnee (child) belong to thesame universe. So, it may not be the case as you mention.


parent:

15923 ?? S 0:00 orted --no-daemonize --bootproxy 1 --name 0.0.1--num_procs 2 --vpid_start 0 --nodename burl-ct-v440-5 --universepaklui@burl-ct-v440-5:default-universe --nsreplica"0.0.0;tcp://10.8.30.128:47797" --gprreplica"0.0.0;tcp://10.8.30.128:47797" --mpi-call-yield 0


child on the same node as parent:

15935 ?? S 0:00 orted --no-daemonize --bootproxy 2 --name 0.0.3--num_procs 4 --vpid_start 0 --nodename burl-ct-v440-5 --universepaklui@burl-ct-v440-5:default-universe --nsreplica"0.0.0;tcp://10.8.30.128:47797" --gprreplica"0.0.0;tcp://10.8.30.128:47797" --mpi-call-yield 0


child on the different node:

5563 ?? S 0:00 orted --no-daemonize --bootproxy 2 --name 0.0.2--num_procs 4 --vpid_start 0 --nodename burl-ct-v440-4 --universepaklui@burl-ct-v440-5:default-universe --nsreplica"0.0.0;tcp://10.8.30.128:47797" --gprreplica"0.0.0;tcp://10.8.30.128:47797" --mpi-call-yield 0

I assume when you say the second mpirun it means internally theMPI_Comm_spawn is treated as a mpirun (or orterun). I use only 1 mpirunto run the spawner executable, just to clarify.


So, the RAS in the second

universe reads the qrsh environmental variable to see how many slots areavailable - and doesn't know that the other mpirun already used some.

I can definately agree with that. In fact, there's actually no easy way(no env vars) to find out out many slots (for qrsh tasks) have been usedup already. The SGE engineer that I worked with also agrees that can beuseful, not just for us, but for other MPI implementations out there aswell. So it's something I might need to work out with them.

assume, therefore, that the other mpirun is basically being run in thebackground - we don't complete it first before letting the next onebegin executing?

there is only 1 mpirun, but the orted for both the parent and child areactually running in parallel at the same time.

The only solution I can think of to that problem would be to kickoff apersistent daemon to act as the "seed" for the entire time we are in theshell (either interactive or batch). This will ensure that the knowledgeof resource usage carries over from one execution to the next. Weactually do this with the Eclipse folks, so I know the mechanism works.We also actually kickoff a daemon that does the launch in many of thedifferent systems - only difference here is that we normally don't makeit persistent (it is just a child of mpirun). Problem here is to figureout how to handle the persistent part of this transparently to the user.
What we could do is bury this in an appropriate component somewhere (Ihave an idea where it might go, but need to think a little more to besure). If this is the first mpirun within a given shell, then we kickoffa persistent orted to act as the seed and connect ourselves to it (wehave functions in the system to do this already). If we are not thefirst mpirun, then we just connect to the existing "seed". The "seed"does all the actual launching of applications.
This lets each mpirun exit as usual - only the seed keeps alive. Wewould need to establish a way to kill the seed when all the mpiruns arecomplete - sort of a "last-one-out kills the orted" procedure. Thatwould take a little care as we don't want a race condition to creep intothe system - if another mpirun is coming, but the prior one exitsquickly, we don't want the seed to die just yet.

this is good information, but I haven't played around with persistentdaemons at all. It should all make sense after a day or two. I'll getback to you on this later.

Hope all that helps. We may be able to resolve this in a fairlystraightforward manner - I think a lot of the necessary tools arealready in the system, we just need to "hook them up" appropriately for SGE.


yup, that's the goal.

Ralph



Pak Lui wrote:
Hi,

When I run a spawn program over rsh/ssh, I notice that each time the
child program gets spawned, it will need to establish a new rsh/ssh
connection to the remote node to launch orted on that node, even the
parent executable and the orted are running on that node.

So I wonder if there is any way that we can use the parent orted to
launch the child program if they happen to be on the same node?
I try to compare to the spawn program to the scenario where I runmultiple executables in one mpirun command. For this run, I onlyestablish one connection to the remote node only, and bothexecutables shared the same remote connection.
% ./mpirun -np 2 -host burl-ct-v440-5 -prefix `pwd`/.. sleep 12 : -np 2
sleep 10
Password:

15015 /workspace/paklui/ompi/trunk/builds/sparc32-g/bin/../bin/orted
--bootprox
   15017 sleep 12
   15019 sleep 12
   15021 sleep 10
   15023 sleep 10
The reason that I want to find out if it is possible for orted tolaunch child executable(s) without having to establish a newconnection, is because the number of times that I can run 'qrsh' inSGE (or N1GE) is actually depended on the number of slots that theuser initially allocated. That the slot number corresponds to thenumber of CPUs on a node. Each slot allows one 'qrsh' connection.
The issue is when I try to run a spawn job on a single node, or acluster of many 1-cpu nodes under SGE. The number of times that theprogram can spawn is limited by 'qrsh', that it forbids the childprogram to connect to the same node where the parent executable'sorted might be already running there.
I am curious to see if I can find some solution to the problem here.I am also looking to see if there are some tricks in SGE to getaround this issue, but workaround I can see aren't pretty though. SoI welcome your questions, comments or suggestions on this.
------------------------------------------------------------------------

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--

Thanks,

- Pak Lui
pak....@sun.com

Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

Reply via email to