So I tracked this issue and it seems that the new behavior was introduced one year ago by the commit 12433. Starting from this commit, there was no pipeline in the RDMA protocol. That make sense as we usually don't use NetPipe all the time to check the performances of the message logging (we use real applications). However, last week, we did a NetPipe and that's how we realized the lack of pipelining for the RDMA case.

I would be in favor of having a consistent behavior everywhere. In other words don't ask the user to know if there is or not an mpool associated with a particular device, in order to figure out what protocol we use internally. Actually, it's not only for users, it might help us as well.

  Thanks,
    george.

On Feb 20, 2008, at 4:29 AM, Gleb Natapov wrote:

On Tue, Feb 19, 2008 at 10:40:46PM -0500, George Bosilca wrote:
Actually, it restores the original behavior. The RDMA operations were
pipelined before the r15247 commit, independent of the fact that they
had mpool or not. We were actively using this behavior in the message
logging framework to hide the cost of the local storage of the payload,
and we were quite surprised when we realized that it disappeared.
I checked v1.2 with tcp BTL (I can't test mx or elan, but tcp also
support RDMA and has no mpool) and no matter what btl_tcp_max_rdma_size
I provide the whole buffer is sent in one rdma operation. And here is
explanation why this happens:
1. If BTL is RDMA capable but does not provide mpool
mca_pml_ob1_rdma_btls() assumes that memory is always registered. This
function will always return non zero value for any buffer it is called
with in our case.

2. When mca_pml_ob1_send_request_start_btl() chooses what function to
use for rendezvous send it checks if buffer is contiguous and if it is
then it check if buffer is already registered by checking non zero value
returned by mca_pml_ob1_rdma_btls() and for BTLs without mpool
mca_pml_ob1_send_request_start_rdma() is always chosen.

3. Receiver checks if local buffer is registered by calling
mca_pml_ob1_rdma_btls() on it (see pml_ob1_recvreq.c:259):

 recvreq->req_rdma_cnt = mca_pml_ob1_rdma_btls(
         bml_endpoint,
         (unsigned char*) base,
         recvreq->req_recv.req_bytes_packed,
         recvreq->req_rdma);
So recvreq->req_rdma_cnt is set to non zero value (if receive buffer is
contiguous of cause).

4. Receiver send PUT messages to a senders in
mca_pml_ob1_recv_request_schedule_exclusive(). Here is the code snip
from the function (see pml_ob1_recvreq.c:684):

      /* makes sure that we don't exceed BTL max rdma size
       * if memory is not pinned already */
      if(0 == recvreq->req_rdma_cnt &&
            bml_btl->btl_max_rdma_size != 0 &&
            size > bml_btl->btl_max_rdma_size)
      {

          size = bml_btl->btl_max_rdma_size;
      }
Pay special attention to a comment. If recvreq->req_rdma_cnt is not
zero btl_max_rdma_size is ignored and message is send by one big RDMA
operation.

So what I have shown here is that there was no pipeline for TCP btl in
v1.2 and that the code specifically written to behave this way.
If you still think that there is a difference in behaviour between v1.2
and the trunk can you explain what code path is executed in v1.2 for
your test case and how trunk behaves differently.


If a BTL don't want to use pipeline for RDMA operations, it can set the RDMA fragment size to the max value, and this will automatically disable the pipeline. However, if the BTL support pipeline with the trunk version today it is not possible to activate it. Moreover, in the current version the parameters that define the BTL behavior are blatantly ignored, as the
PML make high level assumption about what they want to do.
I am not defending current behaviour. If you want to change it we can
discuss exact semantics that you want to see. But before that I what to
make sure that trunk is indeed different from v1.2 in this regard like
you claim it to be. Can you provide me with a test case that works
differently in v1.2 and the trunk?

--
                        Gleb.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to