Greetings, We had some questions about the best way to make use of Open MPI's features for a new BTL... the general theme is making use of the opal_event's versus a btl_progress function. When is it best to do one versus the other?
We are working on several designs for an SCTP BTL for Open MPI. The familiar one is to use "TCP-style" one-to-one sockets, which have a socket per endpoint pair, just like the TCP BTL does now. However, a more unfamiliar one is to use a single "UDP-style" one-to-many socket per BTL. To illustrate, pretend you have 3 processes... each process only has one socket upon which connections are established, messages are sent, and messages are received to/from the other two processes. It is this design that currently we have some questions about.... So far, we have not been implementing our own btl_progress function. This means that within opal_progress(), poll() is called based on the opal events registered within the BTL. Like TCP, for example, when an MPI_Send happens, the endpoint_send_event is added and POLLOUT is added for this socket for a given endpoint. Since MPI_Send is blocking, it doesn't really matter that this socket is used for other btl_endpoints because it is the only endpoint with an opal event for sending added. However, this is not the case with non-blocking... When we have multiple outstanding non-blocking requests to different endpoints, we have to queue them since the endpoints share the same one-to-many socket and events are associated with a single btl_endpoint. >From proc C, say we have this pseudo code running: iSend(proc A) iSend(proc B) Waitall() Within Waitall, our current design using opal events has the iSend to proc A eventually complete but prior to this, the iSend to proc B can't start until proc A's is done. We currently queue the endpoints waiting for the poll() POLLOUT event and dequeue from this queue when the event from proc A's endpoint is deleted (and add proc B's endpoint to the POLLOUT event). Can you think of a way using the existing framework to eliminate the restriction of the send to proc B having to complete prior to the send to proc B starting? We were trying to use the existing framework but for our case, it may make more sense to implement our own btl_progress function since poll() doesn't really make sense for a single socket anyway... Do you think that would be best? We noticed that mca_bml_r2_progress calls btl_progress[i]() which is set in mca_bml_r2_add_procs if NULL != btl->btl_component->btl_progress. Is there an example of a btl that implements its own btl_progress function? I just want to make sure this is even a possibility before traveling down this path... and maybe learn from others prior. Thanks ahead of time for any help! brad