On Wed, 2007-05-09 at 11:42 -0400, Donald Kerr wrote: > I agree OMPI trac ticket #890 should cover this. I will test the > suggested fix, just removing that one line from btl_udapl.c, on Solaris. > I am still not set up on Linux so hopefully Steve can confirm there. >
All, First, I haven't tested Arlins dat_ep_query() fix yet as we have determined its not needed. The OMPI udapl btl never calls dat_ep_query()... So running OMPI with the suggested fix (removing the overwriting of the hca_addr port field in btl_udapl.c) over ofed udapl on chelsio's iwarp rnic still doesn't work. There are two new issues so far: 1) this has uncovered a connection migration issue in the Chelsio driver/firmware. We are developing and testing a fix for this now. Should be ready tomorrow hopefully. 2) OMPI is not adhering to the iwarp protocol requirement that the ULP, in this case OMPI, initiating the iwarp connection (the side issuing the dat_ep_connect() or rdma_connect()) _MUST_ be the first to send an RDMA message. So if a OMPI process _accepts_ an rdma connection, then it cannot send on that connection until it receives some sort of rdma operation from the client process. It appears the current OMPI connection setup model doesn't enforce this. This combined with the bug above causes an immediate connection failure on chelsio's rnic. After I fix #1 above, things might get slightly better but my guess is we will still have connection setup problems if the server side sends before the client side finishes streaming->rdma mode transition. There have been a series of discussions on the ofa general list about this issue, and the conclusion to date is that it cannot be resolved in the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because sending an RDMA message involves the ULP's work queue and completion queue, so the CM cannot do this under the covers in a mannor that doesn't affect the application. Thus, the applications must deal with this. Here is a possible solution: I assume in OMPI that connections are only initiated when the mpi application does a send operation. Given that, then udapl btl must ensure that if a given rank accepts a connection, it cannot not send anything until the rank at the other end of the connection sends first. Since the other side initiated the connection, it will have pending data to send... I haven't looked into how painful this will be to implement. Thoughts? FYI: IETF Draft requiring this behavior: http://www.ietf.org/internet-drafts/draft-ietf-rddp-mpa-08.txt See section 7 for specifics. Steve.