Hi All,
I´m doing some experiments and modifications in my heartbeat code witch
uses the OOB-TCP communication channel.
My modified orteds and orterun does not abort all processes when one
orted die.
The problem is:
1) I kill an orted, so another orted detect the fault when try to send a
heartbeat to the faulty orted.
2) The RTE get stable again, by the orted which have sent the heartbeat
print the following oob-tcp message:
"[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication
retries exceeded. Can not communicate with peer"
And the question is:
a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it
discards this peer, no?
b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?
c) Why, after retries exceed, the orted continue to plot this message?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478