On Dec 17 2010, Jeff Squyres wrote:
It's not an unknown problem -- as George and Ralph were trying to say, it
was a design decision on our part.
Sadly, flexible dynamic processing is not something that many people ask
for. We have invested time in it over the year to get it working and have
a baseline functionality level. Beyond that, we unfortunately simply
haven't had enough requests to justify spending time to do stuff like you
suggest (e.g., allow abnormal termination of MPI-disconnected processes
to not also take down previously-connected processes). :-(
And my responses (which were probably confusing) were some hint as to WHY
it is a hard problem. I have a lot of experience at this level for a very
wide range of systems, and it's something that I would hate to have to
implement even for a single system - let alone for the range of systems
that OpenMPI supports.
I could tell you some horror stories of processes owned by one user taking
down ones owned by OTHER users, because the controlling terminal had been
reused. And, upon investigation, it wasn't even possible to identify a
bug in any of the programs or operating system - it was merely a "gotcha"
that had sneaked through the cracks in the specifications and bitten me
in a painful place.
The following is what I teach about it in my course (in full):
You can add groups of processes dynamically \break
{\cyan MPI-2} is probably the best way to do this \break
\bully My recommendation is don't even {\magenta think} of it \break
This was a nightmare area in {\cyan PVM} \break
The potential system problems are unbelievable \break
And that is even if you are your own {\sky administrator} \break
If you aren't, you may get strangled for using this \break
Regards,
Nick Maclaren.