Hello Ralph. Are you talking about an MPI communication? If so, then you need to update every proc's modex info for the proc that moved - this is something stored in each MPI proc's memory, so it isn't something that you can just get from the daemon on-demand. You'll have to provide the update to every single proc directly so that it has the info if/when it should decide to send an MPI message to the proc that moved.
Yes, about MPI communications. See the modex database interface in orte/mca/grpcomm/base/grpcomm_base_modex.c. You'll have to create new code to send/recv an update message, but the code to update the database entry exists. What you mean with a send/recv update message i think that has to be something similar to pack/unpack info maybe using also the allgather like it's done in grpcomm_base_modex.c I took a look to the code and i found the orte_grpcomm_base_update_modex_entries(&proc_name, &rbuf) function, and then i printed the attr_name and i get *btl.tcp.1.7 *and others attributes, but i'm not finding any information about the uri, address or something that allows me to communicate with another peer. I'm thinking that i have to (in some way) update the endpoint in some place, but i don't know frome where i can do this, and if there is a function that allows me to do that kind of update. Thanks again. Hugo 2011/6/3 Ralph Castain <r...@open-mpi.org> > Are you talking about an MPI communication? If so, then you need to update > every proc's modex info for the proc that moved - this is something stored > in each MPI proc's memory, so it isn't something that you can just get from > the daemon on-demand. You'll have to provide the update to every single proc > directly so that it has the info if/when it should decide to send an MPI > message to the proc that moved. > > This is why we do a modex upon restart - sending the change to every MPI > proc is hardly scalable minus a collective operation. > > See the modex database interface in > orte/mca/grpcomm/base/grpcomm_base_modex.c. You'll have to create new code > to send/recv an update message, but the code to update the database entry > exists. > > > On Jun 2, 2011, at 7:52 AM, Hugo Meyer wrote: > > Hello again. > > My actual problem is that i don't know where is the struct that has the > information that is used to send messages to the procs. > > Something like: > > Rank URI > 0 21222:tcp:192.168.1.1:1250 > 1 21223:tcp:192.168.1.2:1250 > ..... ..... > > > Because what i need is to update it when i move a process from its original > site, is there something like this?? > > Thanks a lot. > > Hugo > > 2011/5/31 Hugo Meyer <meyer.h...@gmail.com> > >> Hello @ll. >> >> I'm needing some help to restart the communication with a process that i >> restore in a different node. My situation is as follows: >> >> The process fails and it's restored in another node succesfully from a >> previous checkpoint that i sent there. Now, when a process try to send a >> message to this restored process it will fail, or at least, it will be >> locked in *ompi_request_wait_completion. * >> * >> * >> So, when this happens i have to send a message to the daemon of the sender >> that will have the uri of where the process has been restored and answer to >> the proc with this and it will update this info. >> >> So, i need to know where in the code i can capture this attempt to send >> and then send the message to his daemon and where and how i can update this >> info to send the message to the right place (Same rank but new uri). >> >> I have to do it in this way to avoid a collective communication. >> >> If you give me a hand on this, it will be great. >> >> Best regards. >> >> Hugo >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >