I forgot to mention that I completely agree that we don't need (or want) to pause/resume the orteds. This is also in total agreement with the checkpoint/restart philosophy: we are only checkpointing and restarting the user application(s), not the run-time infrastructure. There may still be quiescing issues within ORTE for checkpointing the user applications (per Josh's work and Ralph's explanations), but there's no need to actually pause / checkpoint the orteds themselves.
As a corollary, this means that we likely will not be able to pause / checkpoint in cases where we don't use orteds. I'm fine with that. Currently, the only place where this occurs is on Red Storm, where pausing doesn't make sense (I'm not conversant enough with the Red Storm architecture to know if they care about checkpointing, and if so, how it's handled). > -----Original Message----- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Pak Lui > Sent: Friday, June 02, 2006 11:37 AM > To: r...@lanl.gov; Open MPI Developers > Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted > > I agree that stopping orted may not be the behavior that we > are looking > for. Instead, we can send the signals to the application processes, > since stopping them is what we are interested in. > > The idea is to stop the resource consumption by the user > processes once > the stop signal is sent from N1GE, since orted is being an > administrative daemon rather than a running process that's > doing work, > it probably does not need to be accounted for the resource usage. > > And since 'qrsh' does not issue a 'stop' orted but only give a stop > signal to mpirun, it's really up to mpirun to tell where to give the > stop signal to.