On Jun 4, 2011, at 5:21 AM, Hugo Meyer wrote:

> Thanks for your replies.
> 
> >After doing that, the MPI_Init procedure calls grpcomm.modex to distribute 
> >the data across all procs in the job. Unfortunately, being a collective, all 
> >procs must participate. In your case, you'll have to find a different way to 
> >do it. Upon receipt, each proc updates its own modex db to include the new 
> >info.
> 
> >Look in orte/mca/grpcomm/bad/grpcomm_bad_module.c at the modex function and 
> >follow that code thru the grpcomm/base functions to see how the modex info 
> >is retrieved, passed, and decoded on the far end.
> 
> I will take a look to this Ralph and let you know how it goes. But today 
> looking at the code with a partner, he suggested to me to try to capture an 
> error when sending data through the btl_tcp_endpoint, more precisely in 
> mca_btl_tcp_frag_send and capture there an error when we try to write to the 
> fd of the socket. I've tried this but when a process moves and try to send a 
> message, or someone try to send a message for him, i cannot capture the 
> moment of the failure in the mca_btl_tcp_frag_send, but i don't know why, it 
> is supposed to fail when someone try to send, is there any other place where 
> this is capture? If i do in this way, i can reset connections on demand i 
> suppose. What do you think of this? it's a good idea? And after i detect this 
> failure, i will try to update de modex db of that process from here it's ok?

I'm no expert on the tcp btl - perhaps George can answer?

The run-time has no visibility into MPI connections, and has no understanding 
of the modex contents. So if a proc detects that it cannot make the btl 
connection, I guess it could send an orte message to the proc it's trying to 
reach, and have that proc return a copy of its modex data?

I guess that could work. You may be running into the MPI layer's own attempts 
to ensure comm success via retry...I know you won't get a send failure just 
because the socket is closed - it'll keep retrying the connection for awhile 
before giving up.


> 
> Thanks
> 
> Hugo
> 
> 
> 
> 2011/6/3 Jeff Squyres <jsquy...@cisco.com>
> On Jun 3, 2011, at 10:12 AM, Ralph Castain wrote:
> 
> > When an MPI proc calls MPI_Init, each btl pushes its contact info into the 
> > modex database - one example is the btl.tcp.1.7 info you found there. That 
> > entry is for the TCP btl, which is probably what you are looking for. There 
> > is no way for you to edit that data - each btl encodes it in its own way 
> > and then adds it to the modex.
> 
> More specifically, whatever each entity puts into the modex is a blob that is 
> only readable by other entities just like itself.  For example, what one TCP 
> BTL puts in the modex can really only be read by another TCP BTL. The 
> contents of what the TCP BTL puts in there is an opaque binary blob from the 
> modex's point of view.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to