I´m not an expert in C neither Open MPI, but I´m a volunteer.

Leonardo

Ralph Castain escribió:
Sorry for delayed response - had some things to finish, then had to stare at this code for awhile.

Unfortunately, the OOB is a snarled can of hideous worms. It looks to me that the OOB continues to attempt to complete any pending message requests once it detects that retries have exceeded the limit. In doing so, it looks like it triggers pending events, which would include pending sends - thus causing it to again emit that error message.

I can't swear to any of this, of course - the worms are really deep and tangled down there.

A rewrite of the OOB is planned for next year - hopefully, the last of the spaghetti to be unraveled. Not sure if that will really happen, though, as I think everyone is afraid of that black hole of despair. If it does, this is one thing we can try to address.

Any volunteers??

Ralph


On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote:

Hi All,

I´m doing some experiments and modifications in my heartbeat code witch uses the OOB-TCP communication channel.

My modified orteds and orterun does not abort all processes when one orted die.

The problem is:

1) I kill an orted, so another orted detect the fault when try to send a heartbeat to the faulty orted.

2) The RTE get stable again, by the orted which have sent the heartbeat print the following oob-tcp message: "[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication retries exceeded. Can not communicate with peer"

And the question is:

a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it discards this peer, no?

b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?

c) Why, after retries exceed, the orted continue to plot this message?

Thanks,
--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Reply via email to