On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
It sounds reasonable to me. I agree with Ralf W about having mpirun
send a STOP to itself - that would seem to solve the problem about
stopping everything.
It would seem, however, that you cannot similarly STOP the daemons
or else you won't be able to CONT the job. I'm not sure how big a
deal that is - I can't think of any issue it creates offhand.
Is there any issue in the MPI comm layers if you abruptly STOP a
process while it's communicating? Especially since the STOP is going
to be asynchronous. Do you need to quiet networks like IB first?
It might be better to allow the MPI procs to do "something" before
actually stopping. This might prevent timeout-sensitive stuff from
failing (although I don't know if Josh's CR code even handles these
kinds of things...?). The obvious case that I can think of is if the
MPI process is stopped in the middle of an openib CM action. None of
the openib CPC's can currently handle a timeout nicely.
--
Jeff Squyres
Cisco Systems