Re: [OMPI devel] calling sendi earlier in the PML

Eugene Loh Tue, 3 Mar 2009 15:32:07 -0500

Jeff Squyres wrote:

How about a compromise...
Keep a separate list somewhere of the sendi-enabled BTLs (this avoidslooping over all the btl's and testing -- you can just loop over thebtl's that you *know* have a sendi). Put that at the top of the PMLand avoid the costly overhead, yadda yadda yadda.
But instead of having a static list of sendi-enabled BTLs, rotatethem if there's >1. For example, say I have 3 sendi-enabled BTLmodules: A, B, C.
In the first send, A->sendi() is used and it succeeds, so we shufflethe list and return.In the next send, B->sendi() is used and it succeeds, so we shufflethe list and return.In the next send, C->sendi() is used but it fails, so we shuffle thelist and fall through to normal ->send() processing.
"shuffle the list" can be as simple as opal_list_remove_first() andopal_list_append() -- both of which should be O(1).
This should distribute the load across sendi-enabled BTLs, and ifthose ever get "overloaded" (such that sendi fails), we fall throughto normal load-balanced PML sending.

First, this behavior is basically what I was proposing and what Georgedidn't feel comfortable with. It is arguably no compromise at all.(Uggh, why must I be so honest?) For eager messages, it favors BTLswith sendi functions, which could lead to those BTLs becomingoverloaded. I think favoring BTLs with sendi for short messages isgood. George thinks that load balancing BTLs is good.


Second, the implementation can be simpler than you suggest:

*) You don't need a separate list since testing for a sendi-enabled BTLis relatively cheap (I think... could verify).*) You don't need to shuffle the list. The mechanism used by ob1 justresumes the BTL search from the last BTL used. E.g., checkhttps://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/pml/ob1/pml_ob1_sendreq.h#mca_pml_ob1_send_request_start. You use mca_bml_base_btl_array_get_next(&btl_eager) to roundrobinover BTLs in a totally fair manner (remembering where the last loop leftoff), and using mca_bml_base_btl_array_get_size(&btl_eager) to make sureyou don't loop endlessly.

I've been toying with two implementations. The one I described in SanJose was called FAST, so let's still call it that. It tests for sendiearly in the PML, calling traditional send only if no sendi is found forany BTL. To preserve the BTL ordering George favors (alwaysroundrobinning over BTLs, looking only secondarily for sendi), I triedanother implementation I'll call FAIR. It attempts to initialize thesend request only very minimally. One still makes a number of functioncalls and goes "deep" into the PML, but defers as much send-requestinitialization as late as possible. I can't promise that bothimplementations FAST and FAIR are equally rock solid or optimized, butthis is where I am so far. The differences are:


*) FAST involves far fewer code changes.

*) FAST produces faster latencies. E.g., for 0-byte OSU latencies, FASTis 8-10% better than OMPI while FAIR is only 1-3% (or 2-3%... somethinglike that). (The improvements I showed in San Jose for FAST were moredramatic than 8-10%, but that's because there were optimizations on thereceive side and in the data convertors as well. For the e-mail you'rereading right now, I'm talking just about send-request optimizations.)*) Theoretically, FAIR is broader reaching. E.g., if persistent sendscan always use a sendi path, they will all potentially benefit. (Thisis theory. I haven't actually observed such a speed-up yet and it mightjust end up getting lost in the noise.)

Re: [OMPI devel] calling sendi earlier in the PML

Reply via email to