Paul -- Many thanks for your detailed report. I apparently missed a whole boatload of e-mails on 2 May due to a problem with my mail client. Deep apologies for missing this mail! :-(
More information below. > -----Original Message----- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Paul Donohue > Sent: Friday, May 05, 2006 10:47 PM > To: de...@open-mpi.org > Subject: [OMPI devel] Oversubscription/Scheduling Bug > > I would like to be able to start a non-oversubscribed run of > a program in OpenMPI as if it were oversubscribed, so that > the processes run in Degraded Mode, such that I have the > option to start an additional simultaneous run on the same > nodes if necessary. > (Basically, I have a program that will ask for some data, run > for a while, then print some results, then stop and ask for > more data. It takes some time to collect and input the > additional data, so I would like to be able to start another > instance of the program which can be running while i'm > inputting data to the first instance, and can be inputting > while the first instance is running). > > Since I have single-processor nodes, the obvious solution > would be to set slots=0 for each of my nodes, so that using 1 > slot for every run causes the nodes to be oversubscribed. > However, it seems that slots=0 is treated like > slots=infinity, so my processes run in Aggressive Mode, and I > loose the ability to oversubscribe my node using two > independent processes. I'd prefer to keep the slots=0 synonymous to "infinity", if only for historical reasons (it's also less code to change :-) ). > So, I tried setting '--mca mpi_yield_when_idle 1', since this > sounded like it was meant to force Degraded Mode. But, it > didn't seem to do anything - my processes still ran in > Aggressive Mode. I skimmed through the source code real > quick, and it doesn't look like mpi_yield_when_idle is ever > actually used. Are you sure? How did you test this? I just did a few tests and it seems to work fine for me. The MCA param "mpi_yield_when_idle" is actually used within the OPAL layer (the name is somewhat of an abstraction break -- it reflects the fact that the progression engine used to be up in the MPI layer; it got put in OPAL when the entire source code tree was split into OPAL, ORTE, and OMPI) in opal/runtime/opal_progress.c. You can check for whether this param is set or not by using the mpi_show_mca_params MCA parameter. Setting this parameter to 1 will make all MPI processes display the current values for their MCA parameters to stderr. For example: ----- shell% mpirun -np 1 --mca mpi_show_mca_params 1 hello | & grep yield [foo.example.com:23206] mpi_yield_when_idle=0 shell% mpirun -np 1 --mca mpi_yield_when_idle 1 --mca mpi_show_mca_params 1 hello | & grep yield [foo.example.com:23213] mpi_yield_when_idle=1 ----- It may be difficult to tell if this behavior is working properly because, by definition, if you're in an oversubscribed situation (assuming that all your processes are trying to fully utilize the CPU), the entire system could be running pretty slowly anyway. The difference between aggressive and degraded mode is that we call yield() in the middle of tight progression loops in OMPI. Hence, if you're oversubscribed, this actually gives other processes a chance of being scheduled / run by the OS. For example, if you oversubscribe and don't have this param set, because OMPI uses tight repetitive loops for progression, you will typically see one process completely hogging the CPU for a long, long time before the OS finally lets another be scheduled. I just did a small test: running 3 processes on a 2-way SMP. Each MPI process sends a short message around in a ring pattern 100 times: - mpi_yield_when_idle=1 : 1.4 seconds running time - mpi_tield_when_idle=0 : 22.8 seconds running time So it can make a big difference. But don't expect it to completely mitigate the effects of oversubscription. > I also noticed another bug in the scheduler: > hostfile: > A slots=2 max-slots=2 > B slots=2 max-slots=2 > 'mpirun -np 5' quits with an over-subscription error > 'mpirun -np 3 --host B' hangs and just chews up CPU cycles forever Yoinks; this is definitely a bug. I've filed a bug in our tracker to get this fixed. Thanks for reporting it. > And finally, on http://www.open-mpi.org/faq/?category=tuning > - 11. How do I tell Open MPI to use processor and/or memory affinity? > It mentions that OpenMPI will automatically disable processor > affinity on oversubscribed nodes. When I first read it, I Correct. > made the assumption that processor affinity and Degraded Mode > were incompatible. However, it seems that independent > non-oversubscribed processes running in Degraded Mode work > fine with processor affinity - it's only actually > oversubscribed processes which have problems. A note that > Degraded Mode and Processor Affinity work together even > though Processor Affinity and oversubscription do not would be nice. Good point. I'll update the FAQ later today; thanks! -- Jeff Squyres Server Virtualization Business Unit Cisco Systems