Jeff Squyres wrote:
On Mar 9, 2008, at 3:39 PM, Gleb Natapov wrote:

1. There was a discussion about this on openfabrics mailing list and the conclusion was that what Open MPI does is correct according to IB/ iWarp
spec.

2. Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not return before
post receive is processed?

3. I personally don't like the idea to add another layer of complexity to openib BTL code just to work around HW that doesn't follow spec. If work around is simple that is OK, but in this case it is not so simple and will add code path that is rarely tested. A simple workaround for the problem may be to not configure multiple QPs if HW has a bug (and we can extend ini
file to contain this info).


These are all valid points.

In thinking about Gleb's proposal a bit more (extend the INI file syntax to accept per-HCA receive_queues values), it might be only somewhat less efficient (and a lot less code) than sending all flow control messages on the respective qp's anyway. So let's explore the math...

The "let's use multiple QP's for short messages" scheme (a.k.a. BSRQ) was invented to get better registered memory utilization. Pushing all the FC messages down to the QP with the smallest buffer size was a desirable side-effect that made registered memory utilization even better (because short FC messages were naturally on the QP with the smallest buffer size). Specifically, today in openib/IB (SVN trunk), here's the default queue layout:

pp: 256 buffers of size 128
srq: 256 buffers of size 4k
srq: 256 buffers of size 12k (eager limit)
srq: 256 buffers of size 64k (max send size)

And then we add 4 more buffers on the pp qp for flow control messages (since we only currently send FC messages for pp qp's). Total registered memory for a job with 1 remote peer: (256+4)*128 + 256*4k + 256*12k + 256*64k = ~20M. This is somewhat deceiving, because the total registered memory scales slowly with the number of procs in the job (e.g., with 2 remote peers, in only increases by 33k because we're using srq's).

With Gleb's proposals, you'd only have one pp qp, assumedly 64k (or whatever the max send size is):

pp: 256 buffers of size 64k (max send size)

And then add 4 more for flow control messages. So total registered memory for a job with 1 remote peer: (256+4)*64k = ~17M. But that figure is approximately a per-peer cost -- so a job with 2 remote peers would use ~34M of registered memory, etc. This will [obviously] scale extremely poorly (and is one of the reasons that BSRQ exists).

However, I wonder if there's a compromise (assuming you can't fix ibv_post_recv() to not return until the buffers are actually available, which, I agree with Gleb, seems like the best fix). Since we only use FC messages on pp qp's, why not make a "you can only have 1 pp qp and it must be qp 0" restriction for the Chelsio RNIC? This fits nicely into our default receive_queues value, anyway. That way, all FC messages will naturally go over qp 0 anyway (since that will be the only pp qp). Then, the only problem you have to solve is sending the *initial* credits message at wireup time (to know when the receive buffers have actually been posted to the srq's). Perhaps something like this:

1. you can export an attribute from the RNIC that advertises that ibv_post_recv() works this way (so that OMPI can detect it at run time)

2. hide the extra wireup / initial credit coordination in the rdma cpc when this attribute is detected (or make an mca param / ini file param that specifically requests for this extra rdma cm cpc behavior (or not).

What would make this proposal moot is if the Chelsio RNIC can't do SRQs (I don't remember offhand). If it can't (and you can't fix ibv_post_recv()), then you might as well do Gleb's "just use one qp" proposal. You'll get lousy registered memory utilization, but the bigger problem you'll have is the scalability issues for large-peer- count jobs (e.g., using the values above, 17M of registered memory per peer; I assume you'll have to tune that down via .ini file params).

What about that?

This gen of the chelsio rnic doesn't support SRQs.

I don't think we can fix post_recv to behave like we want.

A single PP QP might be fine for now, and chelsio's next-gen part will support SRQs and not have this funky issue.

But why use such a large buffer size for a single PP QP? Why not use something around 16KB?


Steve.

Reply via email to