On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:

It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything.

It would seem, however, that you cannot similarly STOP the daemons or else you won't be able to CONT the job. I'm not sure how big a deal that is - I can't think of any issue it creates offhand.

Is there any issue in the MPI comm layers if you abruptly STOP a process while it's communicating? Especially since the STOP is going to be asynchronous. Do you need to quiet networks like IB first?

It might be better to allow the MPI procs to do "something" before actually stopping. This might prevent timeout-sensitive stuff from failing (although I don't know if Josh's CR code even handles these kinds of things...?). The obvious case that I can think of is if the MPI process is stopped in the middle of an openib CM action. None of the openib CPC's can currently handle a timeout nicely.

--
Jeff Squyres
Cisco Systems

Reply via email to