The patch applies to ib_multifrag as is without a conflict. But the
branch
doesn't compile with or without the patch so I was not able to test
it.
Do you have some uncommitted changes that may generate a conflict? Can
you commit them so they can be resolved? If there is no conflict
between
your work and this patch may be it is a good idea to commit it to your
branch and trunk for testing?
I have a whole pile of changes that need to be committed, and even
with these changes, it still doesn't compile as I am reworking names,
and data structures, etc.
I will commit what I have now, and will work on this a bit more over
the weekend.
- Galen
Thanks,
Galen
On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
Hello everyone,
I encountered a problem with openib on depend connection code.
Basically
it works only by pure luck if you have more then one endpoint for
the same
proc and sometimes breaks in mysterious ways.
The algo works like this: A wants to connect to B so it creates QP
and sends it
to B. B receives the QP from A and looks for endpoint that is not
yet associated
with remote endpoint, creates QP for it and sends info back. Now A
receives
the QP and goes through the same logic as B i.e looks for endpoint
that is not
yet connected, BUT there is no guaranty that it will find the
endpoint that
initiated the connection in the first place! And if it finds
another one it will
create QP for it and will send it back to B and so on and so forth.
In the end
I sometimes receive a peculiar mesh of connection where no QP has a
connection
back to it from the peer process.
To overcome this problem B needs to send back some info that will
allow A to
determine the endpoint that initiated a connection request. The
lid:qp pair
will allow for this. But even then the problem will remain if two
procs initiate
connection at the same time. To dial with simultaneous connection
asymmetry
protocol have to be used one peer became master another slave.
Slave alway
initiate a connection to master. Master choose local endpoint to
satisfy
incoming request and sends info back to a slave. If master wants to
initiate a
connection it send message to a slave and slave initiate connection
back to
master.
Included patch implements an algorithm described above and work for
all
scenarios for which current code fails to create a connection.
--
Gleb.
<fix_openib_wireup.diff>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel