Hi list,
We did some testing on memory taken by Infiniband queues in Open MPI using
the XRC protocol, which is supposed to reduce the needed memory for
Infiniband connections.
When using XRC queues, Open MPI is indeed creating only one XRC queue per
node (instead of per-host). The problem is that the number of send
elements in this queue is multiplied by the number of processes on the
remote host.
So, what are we getting from this ? Not much, except that we can reduce
the sd_max parameter to 1 element, and still have 8 elements in the send
queue (on 8 cores machines), which may still be ok on the performance
side.
Send queues are created lazily, so having a lot of memory for send queues
is not necessary blocking. What's blocking is the receive queues, because
they are created during MPI_Init, so in a way, they are the "basic fare"
of MPI.
The XRC protocol seems to create shared receive queues, which is a good
thing. However, comparing memory used by an "X" queue versus and "S"
queue, we can see a large difference. Digging a bit into the code, we
found some strange things, like the completion queue size not being the
same as "S" queues (the patch below would fix it, but the root of the
problem may be elsewhere).
Is anyone able to comment on this ?
Thanks,
Sylvain
diff -r eeaa1548ddaf ompi/mca/btl/openib/btl_openib.c
--- a/ompi/mca/btl/openib/btl_openib.c Fri May 14 01:08:00 2010 +0000
+++ b/ompi/mca/btl/openib/btl_openib.c Mon May 17 14:34:46 2010 +0200
@@ -379,7 +379,7 @@
/* figure out reasonable sizes for completion queues */
for(qp = 0; qp < mca_btl_openib_component.num_qps; qp++) {
- if(BTL_OPENIB_QP_TYPE_SRQ(qp)) {
+ if(BTL_OPENIB_QP_TYPE_SRQ(qp) || BTL_OPENIB_QP_TYPE_XRC(qp)) {
send_cqes =
mca_btl_openib_component.qp_infos[qp].u.srq_qp.sd_max;
recv_cqes = mca_btl_openib_component.qp_infos[qp].rd_num;
} else {