Hi All,

I´m doing some experiments and modifications in my heartbeat code witch uses the OOB-TCP communication channel.

My modified orteds and orterun does not abort all processes when one orted die.

The problem is:

1) I kill an orted, so another orted detect the fault when try to send a heartbeat to the faulty orted.

2) The RTE get stable again, by the orted which have sent the heartbeat print the following oob-tcp message: "[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication retries exceeded. Can not communicate with peer"

And the question is:

a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it discards this peer, no?

b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?

c) Why, after retries exceed, the orted continue to plot this message?

Thanks,
--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Reply via email to