Re: [OMPI devel] Sending large messages over RDMA fails

Doron Shoham Sun, 5 Dec 2010 11:28:11 -0500

Jeff Squyres wrote:

On Nov 29, 2010, at 3:51 AM, Doron Shoham wrote:

If only the PUT flag is set and/or the btl supports only PUT method then the 
sender will allocate a rendezvous header and will not eager send any data. The 
receiver will schedule rdma PUT(s) of the entire message.
It is done in mca_pml_ob1_recv_request_schedule_once()
(ompi/mca/pml/ob1/pml_ob1_recvreq.c:683).
We can limit the size passing to mca_bml_base_prepare_dst() to be minimum 
between btl.max_message_size supported by the HCA and the actual message size.
The will result a fragmentation of the RDMA write messages.


I would think that we should set btl.max_message_size during init to be the 
minimum of the MCA param and the max supported by the HCA, right?  Then there's 
no need for this min() in the critical path.

Additionally, the message must be smaller than the max message size of *both* 
HCAs, right?  So it might be necessary to add the max message size into the 
openib BTL modex data so that you can use it in mca_bml_base_prepare_dst() (or 
whatever -- been a long time since I've mucked around in there...) to compute 
the min between the two peers.

So you might still need a min, but for a different reason than what you 
originally mentioned.

It is my mistake - the btl.max_message_size is a different parameter. Itis more like software limitation rather then hardware limitation fromthe HCA.

I don't think that in RDMA flow it has any meaning.

Can you please explain a bit more about the openib BTL modex?

The bigger problem is when using the GET flow.
In this flow the receiver allocate one big buffer to receive the message with 
RDMA read in one chunk.
There is no fragmentation mechanism in this flow which make it harder to solve 
this issue


Doh.  I'm afraid I don't know why this was done this way originally...

Reading the max message size supported by the HCA can be done by using verbs.
The second approach is to use RDMA direct only if the message size is smaller than the max message size supported by the HCA.Here is where the long message protocol is chosen:
ompi/mca/pml/ob1/pml_ob1_sendreq.h line 382.
We could use the second approach until a fragmentation mechanism will be added to the RDMA direct GET flow.


Are you suggesting that pml_ob1_sendreq.h:382 compare the message length to the 
btl.max_message_size and choose RDMA direct vs. RDMA pipelined?  If so, that 
might be sufficient.

But what to do about the peer's max message size?


I thought of a different approach:

Instead of limiting the passing to the mca_bml_base_prepare_dst(), wecan limit the size in mca_btl_openib_prepare_dst().I believe this is better solution because it only effects the internalbehavior of the openib btl.In mca_btl_openib_prepare_dst() we have access to both max_msg_sz (localand endpoint).

This will fix the PUT flow.

For the GET flow, maybe we should check inmca_pml_ob1_send_request_start_rdma() -if the message size is larger then the minimum between both endpoints'max_msg_sz force it use the PUT flow.


The problem is that I'm not sure how to do it without an *ugly hack*.

We need to to access the openib btl parameters from themca_pml_ob1_send_request_start_rdma().

The second options it to do it from pml_ob1_sendreq.h:382, but thenagain, it requires access to the openib btl parameters...


Any thoughts?

Thanks,
Doron

Re: [OMPI devel] Sending large messages over RDMA fails

Reply via email to