Sean Hefty wrote:
...
I didn't follow this.
...
Peer to peer SIDs are in a different domain than client/server SIDs, and
the peer_to_peer field is used to indicate which domain a SID is in.

Sorry if I wasn't clear, let me see if I understand you: with this
different domain implementation, under both client/server the passive
calls cm listen and the active call cm connect, where under peer/to/peer
both sides call cm listen and later both sides may call cm connect or
only one side, correct?

To add to my comments on the CM API, struct ib_cm_req_param, which is
used to send the REQ, includes service_id and peer_to_peer fields.  The
latter is a boolean used by the CM to distinguish if incoming REQs can
be matched with the outgoing REQ.

OK, this makes things clearer.

Why there should be a difference between the rdma-cm to the cm? if in
the cm you have a model without API change, wouldn't it apply also to
the rdma-cm?

The rdma_cm does not know how to set the peer_to_peer field in the
ib_cm_req_param.  It sets this field to 0 today.

But it could set it to one as well... assuming my understanding above of
the suggested implementation is correct, we can change the RDMA-CM API
to let users specify on rdma_connect that they want peer to peer
support, so such apps can issue rdma_listen call and later call
rdma_connect with this bit set and they are done (or almost done... I
guess there some more devil in the details here, isn't it?)

 > I think that in the MPI world each rank gets a SID from the local CM and
 > they exchange the SIDs out-of-band, then connections are opened. If its
 > a connection-on-demand scheme, then when ever the rank process calls
 > mpi_send() to peer for which the local MPI library does not have a
 > connection, it tries to connect. So if this happens "at once" between
 > some pair of ranks, there should be a way to form one connection out of
 > these two connecting requests. My thinking/motivation is that support of
 > this scheme should be in the IB stack (cm and rdma-cm) level and not in
 > the specific MPI implementation level.

Are the out of band connections used by MPI formed using client/server
or peer to peer?  I believe that Intel MPI has each rank listen for
connections from the ranks below it using client/server.

yes, MPIs that do all-to-all-connect on job start, typically use
client/server where all the ranks > 0 issue listen call and then all
lower ranks connect to higher ranks or etc some other symmetry breaking
scheme. I am trying to see what needs to be supported by the IB stack to let MPIs that do connect on demand use the RDMA-CM.

There are a couple of problems with the peer to peer model.  First,
unless the connections occur at exactly the same time, they miss
connecting (rejected with invalid SID).

This makes the all peer to peer model useless, since an app can not make
sure that connection occur at exactly the same time! my understanding of
the spec is that peer to peer model has the ability to handle also connections that occur at exactly the same time but not only.

Second, if multiple peer to
peer connections need to form between the same pair of nodes, things can
go screwy (that's the technical term) trying to match up the peer requests.

Under MPI each rank uses a different SID, so I think we are safe from this problem.

Or





_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to