Please see below.


When using XRC queues, Open MPI is indeed creating only one XRC queue per node (instead of per-host). The problem is that the number of send elements in this queue is multiplied by the number of processes on the remote host.

So, what are we getting from this ? Not much, except that we can reduce the sd_max parameter to 1 element, and still have 8 elements in the send queue (on 8 cores machines), which may still be ok on the performance side.
Don't forget the the QP object itself consume some memory on driver/verbs level. BUT , but I agree that we need to provide more flexibility and it will be nice that default multiply coefficient will be smaller , as well I think we need to make it user tunable parameter (yep, one more parameter).

Send queues are created lazily, so having a lot of memory for send queues is not necessary blocking. What's

blocking is the receive queues, because they are created during MPI_Init, so in a way, they are the "basic fare" of MPI.
BTW SRQ resources are also allocated on demand. We start with very small SRQ and it is increased on SRQ limit event.


The XRC protocol seems to create shared receive queues, which is a good thing. However, comparing memory used by an "X" queue versus and "S" queue, we can see a large difference. Digging a bit into the code, we found some
So, do you see that X consumes more that S ? This is really odd.
strange things, like the completion queue size not being the same as "S" queues (the patch below would fix it, but the root of the problem may be elsewhere).

Is anyone able to comment on this ?
The fix looks ok, please submit it to trunk.
BTW do you want to prepare the patch for send queue size factor ? It should be quite simple.

Regards,
Pasha

Reply via email to