On Wed, Jun 13, 2007 at 07:08:51PM +0300, Gleb Natapov wrote: > On Wed, Jun 13, 2007 at 09:38:21AM -0600, Galen Shipman wrote: > > Hi Gleb, > > > > As we have discussed before I am working on adding support for > > multiple QPs with either per peer resources or shared resources. > > As a result of this I am trying to clean up a lot of the OpenIB code. > > It has grown up organically over the years and needs some attention. > > Perhaps we can coordinate on commits or even work from the same temp > > branch to do an overall cleanup as well as addressing the issue you > > describe in this email. > > > > I bring this up because this commit will conflict quite a bit with > > what I am working on, I can always merge it by hand but it may make > > sense for us to get this all done in one area and then bring it all > > over? > > I am not committing this yet. I want people to review my logic and the > patch. If the change is OK with everyone how cares then I want this > change to go into 1.2 branch. > > I don't care how this change will get to the trunk. I can use patched > version for a while. If you branch is in working state right now I can > merge this change into it tomorrow.
The patch applies to ib_multifrag as is without a conflict. But the branch doesn't compile with or without the patch so I was not able to test it. Do you have some uncommitted changes that may generate a conflict? Can you commit them so they can be resolved? If there is no conflict between your work and this patch may be it is a good idea to commit it to your branch and trunk for testing? > > > > > Thanks, > > > > Galen > > > > > > On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote: > > > > > Hello everyone, > > > > > > I encountered a problem with openib on depend connection code. > > > Basically > > > it works only by pure luck if you have more then one endpoint for > > > the same > > > proc and sometimes breaks in mysterious ways. > > > > > > The algo works like this: A wants to connect to B so it creates QP > > > and sends it > > > to B. B receives the QP from A and looks for endpoint that is not > > > yet associated > > > with remote endpoint, creates QP for it and sends info back. Now A > > > receives > > > the QP and goes through the same logic as B i.e looks for endpoint > > > that is not > > > yet connected, BUT there is no guaranty that it will find the > > > endpoint that > > > initiated the connection in the first place! And if it finds > > > another one it will > > > create QP for it and will send it back to B and so on and so forth. > > > In the end > > > I sometimes receive a peculiar mesh of connection where no QP has a > > > connection > > > back to it from the peer process. > > > > > > To overcome this problem B needs to send back some info that will > > > allow A to > > > determine the endpoint that initiated a connection request. The > > > lid:qp pair > > > will allow for this. But even then the problem will remain if two > > > procs initiate > > > connection at the same time. To dial with simultaneous connection > > > asymmetry > > > protocol have to be used one peer became master another slave. > > > Slave alway > > > initiate a connection to master. Master choose local endpoint to > > > satisfy > > > incoming request and sends info back to a slave. If master wants to > > > initiate a > > > connection it send message to a slave and slave initiate connection > > > back to > > > master. > > > > > > Included patch implements an algorithm described above and work for > > > all > > > scenarios for which current code fails to create a connection. > > > > > > -- > > > Gleb. > > > <fix_openib_wireup.diff> > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.