On Jun 4, 2011, at 5:21 AM, Hugo Meyer wrote: > Thanks for your replies. > > >After doing that, the MPI_Init procedure calls grpcomm.modex to distribute > >the data across all procs in the job. Unfortunately, being a collective, all > >procs must participate. In your case, you'll have to find a different way to > >do it. Upon receipt, each proc updates its own modex db to include the new > >info. > > >Look in orte/mca/grpcomm/bad/grpcomm_bad_module.c at the modex function and > >follow that code thru the grpcomm/base functions to see how the modex info > >is retrieved, passed, and decoded on the far end. > > I will take a look to this Ralph and let you know how it goes. But today > looking at the code with a partner, he suggested to me to try to capture an > error when sending data through the btl_tcp_endpoint, more precisely in > mca_btl_tcp_frag_send and capture there an error when we try to write to the > fd of the socket. I've tried this but when a process moves and try to send a > message, or someone try to send a message for him, i cannot capture the > moment of the failure in the mca_btl_tcp_frag_send, but i don't know why, it > is supposed to fail when someone try to send, is there any other place where > this is capture? If i do in this way, i can reset connections on demand i > suppose. What do you think of this? it's a good idea? And after i detect this > failure, i will try to update de modex db of that process from here it's ok?
I'm no expert on the tcp btl - perhaps George can answer? The run-time has no visibility into MPI connections, and has no understanding of the modex contents. So if a proc detects that it cannot make the btl connection, I guess it could send an orte message to the proc it's trying to reach, and have that proc return a copy of its modex data? I guess that could work. You may be running into the MPI layer's own attempts to ensure comm success via retry...I know you won't get a send failure just because the socket is closed - it'll keep retrying the connection for awhile before giving up. > > Thanks > > Hugo > > > > 2011/6/3 Jeff Squyres <jsquy...@cisco.com> > On Jun 3, 2011, at 10:12 AM, Ralph Castain wrote: > > > When an MPI proc calls MPI_Init, each btl pushes its contact info into the > > modex database - one example is the btl.tcp.1.7 info you found there. That > > entry is for the TCP btl, which is probably what you are looking for. There > > is no way for you to edit that data - each btl encodes it in its own way > > and then adds it to the modex. > > More specifically, whatever each entity puts into the modex is a blob that is > only readable by other entities just like itself. For example, what one TCP > BTL puts in the modex can really only be read by another TCP BTL. The > contents of what the TCP BTL puts in there is an opaque binary blob from the > modex's point of view. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel