Hi list,

We did some testing on memory taken by Infiniband queues in Open MPI using the XRC protocol, which is supposed to reduce the needed memory for Infiniband connections.

When using XRC queues, Open MPI is indeed creating only one XRC queue per node (instead of per-host). The problem is that the number of send elements in this queue is multiplied by the number of processes on the remote host.

So, what are we getting from this ? Not much, except that we can reduce the sd_max parameter to 1 element, and still have 8 elements in the send queue (on 8 cores machines), which may still be ok on the performance side.

Send queues are created lazily, so having a lot of memory for send queues is not necessary blocking. What's blocking is the receive queues, because they are created during MPI_Init, so in a way, they are the "basic fare" of MPI.

The XRC protocol seems to create shared receive queues, which is a good thing. However, comparing memory used by an "X" queue versus and "S" queue, we can see a large difference. Digging a bit into the code, we found some strange things, like the completion queue size not being the same as "S" queues (the patch below would fix it, but the root of the problem may be elsewhere).

Is anyone able to comment on this ?

Thanks,
Sylvain


diff -r eeaa1548ddaf ompi/mca/btl/openib/btl_openib.c
--- a/ompi/mca/btl/openib/btl_openib.c  Fri May 14 01:08:00 2010 +0000
+++ b/ompi/mca/btl/openib/btl_openib.c  Mon May 17 14:34:46 2010 +0200
@@ -379,7 +379,7 @@

     /* figure out reasonable sizes for completion queues */
     for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++) {
-        if(BTL_OPENIB_QP_TYPE_SRQ(qp)) {
+        if(BTL_OPENIB_QP_TYPE_SRQ(qp) || BTL_OPENIB_QP_TYPE_XRC(qp)) {
send_cqes = mca_btl_openib_component.qp_infos[qp].u.srq_qp.sd_max;
             recv_cqes = mca_btl_openib_component.qp_infos[qp].rd_num;
         } else {


Reply via email to