I´m not an expert in C neither Open MPI, but I´m a volunteer.
Leonardo
Ralph Castain escribió:
Sorry for delayed response - had some things to finish, then had to
stare at this code for awhile.
Unfortunately, the OOB is a snarled can of hideous worms. It looks to
me that the OOB continues to attempt to complete any pending message
requests once it detects that retries have exceeded the limit. In
doing so, it looks like it triggers pending events, which would
include pending sends - thus causing it to again emit that error message.
I can't swear to any of this, of course - the worms are really deep
and tangled down there.
A rewrite of the OOB is planned for next year - hopefully, the last of
the spaghetti to be unraveled. Not sure if that will really happen,
though, as I think everyone is afraid of that black hole of despair.
If it does, this is one thing we can try to address.
Any volunteers??
Ralph
On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote:
Hi All,
I´m doing some experiments and modifications in my heartbeat code
witch uses the OOB-TCP communication channel.
My modified orteds and orterun does not abort all processes when one
orted die.
The problem is:
1) I kill an orted, so another orted detect the fault when try to
send a heartbeat to the faulty orted.
2) The RTE get stable again, by the orted which have sent the
heartbeat print the following oob-tcp message:
"[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication
retries exceeded. Can not communicate with peer"
And the question is:
a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it
discards this peer, no?
b) The message is removed from the queue with ORTE_ERR_UNREACH code, no?
c) Why, after retries exceed, the orted continue to plot this message?
Thanks,
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478