How about a compromise...
Keep a separate list somewhere of the sendi-enabled BTLs (this avoids
looping over all the btl's and testing -- you can just loop over the
btl's that you *know* have a sendi). Put that at the top of the PML
and avoid the costly overhead, yadda yadda yadda.
But instead of having a static list of sendi-enabled BTLs, rotate them
if there's >1. For example, say I have 3 sendi-enabled BTL modules:
A, B, C.
In the first send, A->sendi() is used and it succeeds, so we shuffle
the list and return.
In the next send, B->sendi() is used and it succeeds, so we shuffle
the list and return.
In the next send, C->sendi() is used but it fails, so we shuffle the
list and fall through to normal ->send() processing.
"shuffle the list" can be as simple as opal_list_remove_first() and
opal_list_append() -- both of which should be O(1).
This should distribute the load across sendi-enabled BTLs, and if
those ever get "overloaded" (such that sendi fails), we fall through
to normal load-balanced PML sending.
Howzat?
On Mar 2, 2009, at 1:37 PM, Eugene Loh wrote:
I'm on the verge of giving up moving the sendi call in the PML. I
will try one or two last things, including this e-mail asking for
feedback.
The idea is that when a BTL goes over a very low-latency
interconnect (like sm), we really want to shave off whatever we can
from the software stack. One way of doing so is to use a "send-
immediate" function, which a few BTLs (like sm) provide. The
problem is avoiding a bunch of overhead introduced by the PML before
checking for a "sendi()" call.
Currently, the PML does something like this:
for ( btl = ... ) {
if ( SUCCESS == btl->sendi() ) return SUCCESS;
if ( SUCCESS == btl->send() ) return SUCCESS;
}
return ERROR;
That is, it roundrobins over all available BTLs, for each one trying
sendi() and then send(). If ever a sendi or send completes
successfully, we exit the loop successfully.
The problem is that this loop is buried several functioncalls deep
in the PML. Before it reaches this far, the PML has initialized a
large "send request" data structure while traversing some (to me)
complicated call graph of functions. This introduces a lot of
overhead that mitigates much of the speedup we might hope to see
with the sendi function. That overhead is unnecessary for a sendi
call, but necessary for a send call. I've tried reorganizing the
code to defer as much of that work as possible -- performing that
overhead only if it's need to perform a send call -- but I've gotten
braincramp every time I've tried this reorganization.
I think these are the options:
Option A) Punt!
Option B) Have someone more familiar with the PML make these changes.
Option C) Have Eugene keep working at this because he'll learn more
about the PML and it's good for his character.
Option D) Go to a strategy in which all BTLs are tried for sendi
before any of them is tried for a send. The code would look like
this:
for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
return ERROR;
The reason this is so much easier to achieve is that we can put that
first loop way up high in the PML (as soon as a send enters the PML,
avoiding all that expensive overhead) and leave the second loop
several layers down, where it is today. George is against this new
loop structure because he thinks round robin selection of BTLs is
most fair and distributes the load over BTLs as evenly as possible.
(In contrast, the proposed loop would favor BTLs with sendi
functions.) It seems to me, however, that favoring BTLs that have
sendi functions is exactly the right thing to do! I'm not even
convinced that the conditions he's worried about are that common:
multiple eager BTLs to poll, one has a sendi, and that sendi is not
very good or that BTL is getting overloaded.
Anyhow, I like Option D, but George does not.
Option E) Go to a strategy in which the next BTL is tested for a
sendi function. If there is one, use it. If not, just continue
with the usual heavyweight PML procedure. This feels a little
hackish to me, but it'll mean that most of the time that sendi can
be called, the heavyweight PML overhead will be avoided, while at
the same time "fair" roundrobin polling over the BTLs is maintained.
I'll proceed with Option C for the time being. If I don't announce
success or surrender in the next few days, please write to me at the
insane asylum.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems