Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

Jeff Squyres Mon, 10 Mar 2008 09:14:30 -0400

On Mar 9, 2008, at 3:39 PM, Gleb Natapov wrote:

1. There was a discussion about this on openfabrics mailing list andtheconclusion was that what Open MPI does is correct according to IB/iWarp
spec.
2. Is it possible to fix your FW to follow iWarp spec? Perhaps it is
possible to implement ibv_post_recv() so that it will not returnbefore
post receive is processed?
3. I personally don't like the idea to add another layer ofcomplexity to openibBTL code just to work around HW that doesn't follow spec. If workaroundis simple that is OK, but in this case it is not so simple and willaddcode path that is rarely tested. A simple workaround for the problemmaybe to not configure multiple QPs if HW has a bug (and we can extendini
file to contain this info).



These are all valid points.

In thinking about Gleb's proposal a bit more (extend the INI filesyntax to accept per-HCA receive_queues values), it might be onlysomewhat less efficient (and a lot less code) than sending all flowcontrol messages on the respective qp's anyway. So let's explore themath...

The "let's use multiple QP's for short messages" scheme (a.k.a. BSRQ)was invented to get better registered memory utilization. Pushing allthe FC messages down to the QP with the smallest buffer size was adesirable side-effect that made registered memory utilization evenbetter (because short FC messages were naturally on the QP with thesmallest buffer size). Specifically, today in openib/IB (SVN trunk),here's the default queue layout:


pp: 256 buffers of size 128
srq: 256 buffers of size 4k
srq: 256 buffers of size 12k (eager limit)
srq: 256 buffers of size 64k (max send size)

And then we add 4 more buffers on the pp qp for flow control messages(since we only currently send FC messages for pp qp's). Totalregistered memory for a job with 1 remote peer: (256+4)*128 + 256*4k +256*12k + 256*64k = ~20M. This is somewhat deceiving, because thetotal registered memory scales slowly with the number of procs in thejob (e.g., with 2 remote peers, in only increases by 33k because we'reusing srq's).

With Gleb's proposals, you'd only have one pp qp, assumedly 64k (orwhatever the max send size is):


pp: 256 buffers of size 64k (max send size)

And then add 4 more for flow control messages. So total registeredmemory for a job with 1 remote peer: (256+4)*64k = ~17M. But thatfigure is approximately a per-peer cost -- so a job with 2 remotepeers would use ~34M of registered memory, etc. This will [obviously]scale extremely poorly (and is one of the reasons that BSRQ exists).

However, I wonder if there's a compromise (assuming you can't fixibv_post_recv() to not return until the buffers are actuallyavailable, which, I agree with Gleb, seems like the best fix). Sincewe only use FC messages on pp qp's, why not make a "you can only have1 pp qp and it must be qp 0" restriction for the Chelsio RNIC? Thisfits nicely into our default receive_queues value, anyway. That way,all FC messages will naturally go over qp 0 anyway (since that will bethe only pp qp). Then, the only problem you have to solve is sendingthe *initial* credits message at wireup time (to know when the receivebuffers have actually been posted to the srq's). Perhaps somethinglike this:

1. you can export an attribute from the RNIC that advertises thatibv_post_recv() works this way (so that OMPI can detect it at run time)

2. hide the extra wireup / initial credit coordination in the rdma cpcwhen this attribute is detected (or make an mca param / ini file paramthat specifically requests for this extra rdma cm cpc behavior (or not).

What would make this proposal moot is if the Chelsio RNIC can't doSRQs (I don't remember offhand). If it can't (and you can't fixibv_post_recv()), then you might as well do Gleb's "just use one qp"proposal. You'll get lousy registered memory utilization, but thebigger problem you'll have is the scalability issues for large-peer-count jobs (e.g., using the values above, 17M of registered memory perpeer; I assume you'll have to tune that down via .ini file params).


What about that?

--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] OMPI OpenIB Credit Schema breaks Chelsio HW

Reply via email to