On Wed, Jun 13, 2007 at 07:08:51PM +0300, Gleb Natapov wrote:
> On Wed, Jun 13, 2007 at 09:38:21AM -0600, Galen Shipman wrote:
> > Hi Gleb,
> > 
> > As we have discussed before I am working on adding support for  
> > multiple QPs with either per peer resources or shared resources.
> > As a result of this I am trying to clean up a lot of the OpenIB code.  
> > It has grown up organically over the years and needs some attention.
> > Perhaps we can coordinate on commits or even work from the same temp  
> > branch to do an overall cleanup as well as addressing the issue you  
> > describe in this email.
> > 
> > I bring this up because this commit will conflict quite a bit with  
> > what I am working on, I can always merge it by hand but it may make  
> > sense for us to get this all done in one area and then bring it all  
> > over?
> 
> I am not committing this yet. I want people to review my logic and the
> patch. If the change is OK with everyone how cares then I want this
> change to go into 1.2 branch.
> 
> I don't care how this change will get to the trunk. I can use patched
> version for a while. If you branch is in working state right now I can
> merge this change into it tomorrow.

The patch applies to ib_multifrag as is without a conflict. But the branch
doesn't compile with or without the patch so I was not able to test it.
Do you have some uncommitted changes that may generate a conflict? Can
you commit them so they can be resolved? If there is no conflict between
your work and this patch may be it is a good idea to commit it to your
branch and trunk for testing?

> 
> > 
> > Thanks,
> > 
> > Galen
> > 
> > 
> > On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
> > 
> > > Hello everyone,
> > >
> > >   I encountered a problem with openib on depend connection code.  
> > > Basically
> > > it works only by pure luck if you have more then one endpoint for  
> > > the same
> > > proc and sometimes breaks in mysterious ways.
> > >
> > > The algo works like this: A wants to connect to B so it creates QP  
> > > and sends it
> > > to B. B receives the QP from A and looks for endpoint that is not  
> > > yet associated
> > > with remote endpoint, creates QP for it and sends info back. Now A  
> > > receives
> > > the QP and goes through the same logic as B i.e looks for endpoint  
> > > that is not
> > > yet connected, BUT there is no guaranty that it will find the  
> > > endpoint that
> > > initiated the connection in the first place! And if it finds  
> > > another one it will
> > > create QP for it and will send it back to B and so on and so forth.  
> > > In the end
> > > I sometimes receive a peculiar mesh of connection where no QP has a  
> > > connection
> > > back to it from the peer process.
> > >
> > > To overcome this problem B needs to send back some info that will  
> > > allow A to
> > > determine the endpoint that initiated a connection request. The  
> > > lid:qp pair
> > > will allow for this. But even then the problem will remain if two  
> > > procs initiate
> > > connection at the same time. To dial with simultaneous connection  
> > > asymmetry
> > > protocol have to be used one peer became master another slave.  
> > > Slave alway
> > > initiate a connection to master. Master choose local endpoint to  
> > > satisfy
> > > incoming request and sends info back to a slave. If master wants to  
> > > initiate a
> > > connection it send message to a slave and slave initiate connection  
> > > back to
> > > master.
> > >
> > > Included patch implements an algorithm described above and work for  
> > > all
> > > scenarios for which current code fails to create a connection.
> > >
> > > --
> > >                   Gleb.
> > > <fix_openib_wireup.diff>
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
>                       Gleb.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.

Reply via email to