Jeff Squyres wrote:
How about a compromise...
Keep a separate list somewhere of the sendi-enabled BTLs (this avoids
looping over all the btl's and testing -- you can just loop over the
btl's that you *know* have a sendi). Put that at the top of the PML
and avoid the costly overhead, yadda yadda yadda.
But instead of having a static list of sendi-enabled BTLs, rotate
them if there's >1. For example, say I have 3 sendi-enabled BTL
modules: A, B, C.
In the first send, A->sendi() is used and it succeeds, so we shuffle
the list and return.
In the next send, B->sendi() is used and it succeeds, so we shuffle
the list and return.
In the next send, C->sendi() is used but it fails, so we shuffle the
list and fall through to normal ->send() processing.
"shuffle the list" can be as simple as opal_list_remove_first() and
opal_list_append() -- both of which should be O(1).
This should distribute the load across sendi-enabled BTLs, and if
those ever get "overloaded" (such that sendi fails), we fall through
to normal load-balanced PML sending.
First, this behavior is basically what I was proposing and what George
didn't feel comfortable with. It is arguably no compromise at all.
(Uggh, why must I be so honest?) For eager messages, it favors BTLs
with sendi functions, which could lead to those BTLs becoming
overloaded. I think favoring BTLs with sendi for short messages is
good. George thinks that load balancing BTLs is good.
Second, the implementation can be simpler than you suggest:
*) You don't need a separate list since testing for a sendi-enabled BTL
is relatively cheap (I think... could verify).
*) You don't need to shuffle the list. The mechanism used by ob1 just
resumes the BTL search from the last BTL used. E.g., check
https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/pml/ob1/pml_ob1_sendreq.h#mca_pml_ob1_send_request_start
. You use mca_bml_base_btl_array_get_next(&btl_eager) to roundrobin
over BTLs in a totally fair manner (remembering where the last loop left
off), and using mca_bml_base_btl_array_get_size(&btl_eager) to make sure
you don't loop endlessly.
I've been toying with two implementations. The one I described in San
Jose was called FAST, so let's still call it that. It tests for sendi
early in the PML, calling traditional send only if no sendi is found for
any BTL. To preserve the BTL ordering George favors (always
roundrobinning over BTLs, looking only secondarily for sendi), I tried
another implementation I'll call FAIR. It attempts to initialize the
send request only very minimally. One still makes a number of function
calls and goes "deep" into the PML, but defers as much send-request
initialization as late as possible. I can't promise that both
implementations FAST and FAIR are equally rock solid or optimized, but
this is where I am so far. The differences are:
*) FAST involves far fewer code changes.
*) FAST produces faster latencies. E.g., for 0-byte OSU latencies, FAST
is 8-10% better than OMPI while FAIR is only 1-3% (or 2-3%... something
like that). (The improvements I showed in San Jose for FAST were more
dramatic than 8-10%, but that's because there were optimizations on the
receive side and in the data convertors as well. For the e-mail you're
reading right now, I'm talking just about send-request optimizations.)
*) Theoretically, FAIR is broader reaching. E.g., if persistent sends
can always use a sendi path, they will all potentially benefit. (This
is theory. I haven't actually observed such a speed-up yet and it might
just end up getting lost in the noise.)