> -----Original Message----- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Donohue > Sent: Wednesday, May 24, 2006 10:27 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] Oversubscription/Scheduling Bug > > I'm using OpenMPI 1.0.2 (incase it makes a difference) > > $ mpirun -np 2 --hostfile test --host psd.umd.edu --mca > mpi_yield_when_idle 1 --mca orte_debug 1 hostname 2>&1 | grep yield > [psd:30325] pls:rsh: /usr/bin/ssh <template> orted > --debug --bootproxy 1 --name <template> --num_procs 2 > --vpid_start 0 --nodename <template> --universe > paul@psd:default-universe-30325 --nsreplica > "0.0.0;tcp://128.8.96.50:35281" --gprreplica > "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0 > [psd:30325] pls:rsh: not oversubscribed -- setting > mpi_yield_when_idle to 0 > [psd:30325] pls:rsh: executing: orted --debug --bootproxy 1 > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename > psd.umd.edu --universe paul@psd:default-universe-30325 > --nsreplica "0.0.0;tcp://128.8.96.50:35281" --gprreplica > "0.0.0;tcp://128.8.96.50:35281" --mpi-call-yield 0 > $ > > When it runs the worker processes, it passes --mpi-call-yield > 0 to the workers even though I set mpi_yield_when_idle to 1
This actually winds up in a comedy of errors. The end result is that mpi_yield_when_idle *is* set to 1 in the MPI processes. 1. Strictly speaking, you're right that the rsh pls should probably not be setting that variable when *not* oversubscribing. More specifically, we should only set it to 1 when we are oversubscribing. But by point 3 (below), this is actually harmless. 2. The orted gets the option "--mpi-call-yield 0", but it does do the Right thing, actually -- it only sets the MCA parameter to 1 if the argument to --mpi-call-yield is > 0. Hence, in this case, it does not set the MCA parameter. 3. mpirun and the orted bundle up MCA parameters from the mpirun command line and environment and seed them in the newly-spawned processes. As such, mpirun command line and environment MCA parameters override anything that the orted may have set (e.g., via --mpi-call-yield). This is actually by design. You can see this by slightly modifying your test command -- run "env" instead of "hostname". You'll see that the environment variable OMPI_MCA_mpi_yield_when_idle is set to the value that you passed in on the mpirun command line, regardless of a) whether you're oversubscribing or not, and b) whatever is passed in through the orted. I'm trying to think of a case where this will not be true, and I think it's only platforms where we don't use the orted (e.g., Red Storm, where oversubscription is not an issue). > I tried testing 4 processes on a 2-way SMP as well. > One pair of processes is waiting on STDIN. > The other pair of processes is running calculations. > First, I ran only the calculations without the STDIN > processes - 35.5 second run time > Then I ran both pairs of processes, using slots=2 in my > hostfile, and mpi_yield_when_idle=1 for both pairs - 25 > minute run time > Then I ran both pairs of processes, using slots=1 in my > hostfile - 48 second run time This is quite fishy. Note that the processes blocking on STDIN should not be affected by the MPI yield setting -- the MPI yield setting is *only* in effect when you're waiting for progress in an MPI function (e.g., in MPI_SEND or MPI_RECV or the like). So: - on a 2 way SMP - if you have N processes running - 2 of which are blocking in MPI calls - (N-2) of which are blocking on <STDIN> Note that Open MPI's "blocking" calls usually spin trying to make progress. So in the above scenario, you'll have 2 MPI processes spinning heavily and probably fully utilizing both CPUs. The other (N-2) processes should not be a factor. So the question is -- why does setting mpi_yield_when_idle to 1 take so much time? I'm guessing that it's doing exactly what it's supposed to be doing -- lots and lots of yielding (although I agree that a difference of 48 seconds -> 25 minutes seems a bit excessive). The constant yielding could be quite expensive. Are your 2 processes doing a lot of very large communications with each other? > > Good point. I'll update the FAQ later today; thanks! > Sweet! It would probably be worth mentioning > mpi_yield_when_idle=1 in there too - it took some digging for > me to find that option > (After it's fixed, of course ;-) ) Will do. -- Jeff Squyres Server Virtualization Business Unit Cisco Systems