Hello As Jeff stated, the smr has been removed from the system. We did this because experience showed that monitoring process/node status was highly system dependent and directly correlated with the launch system. Thus, it made no sense to separate those two functions.
For example, we have successfully prototyped the detection of orted/node failure on TM based on notification from Torque when the orted fails. A similar approach appears to be working under SLURM (one glitch remains to be ironed out). I would think that a heartbeat protocol would primarily have applicability in the RSH environment. We certainly wouldn't want to do it in TM or SLURM, and I suspect that most of the other managed environments have similar detection mechanisms. If you think there are other environments that also would need a heartbeat, then you could put it in the PLM base and people can call it if they want to use it. My only caveat there is that it increases our binary size since base functions are always compiled, so we would only want to do that if we really thought multiple environments would use it. If it is only RSH, then it would probably better be inserted into the RSH PLM module. Hope that helps Ralph On 3/10/08 9:16 AM, "Leonardo Fialho" <lfia...@aomail.uab.es> wrote: > Hi Jeff, > > I need to implement a heart bit/watchdog monitoring system, I´m looking > for the "best place" to put it and I don´t want to put duplicated code. > I´ll try to put it into PLM for now, and when I get a Ralph´s response I > change it, if necessary. > > Jeff Squyres escribió: >> Yes, it all got consolidated down into plm. We need to update the >> FAQ; the ORTE frameworks changed quite a bit in the recent ORTE merge... >> >> Ralph's on vacation this week. A detailed answer to your question may >> not occur until he returns... >> >> >> On Mar 10, 2008, at 10:05 AM, Leonardo Fialho wrote: >> >> >>> Hi all, >>> >>> Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm... >>> >>> -- >>> Leonardo Fialho >>> Computer Architecture and Operating Systems Department - CAOS >>> Universidad Autonoma de Barcelona - UAB >>> ETSE, Edifcio Q, QC/3088 >>> http://www.caos.uab.es >>> Phone: +34-93-581-2888 >>> Fax: +34-93-581-2478 >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> >