Re: [OMPI devel] calling sendi earlier in the PML

Jeff Squyres Tue, 3 Mar 2009 13:09:35 -0500

How about a compromise...

Keep a separate list somewhere of the sendi-enabled BTLs (this avoidslooping over all the btl's and testing -- you can just loop over thebtl's that you *know* have a sendi). Put that at the top of the PMLand avoid the costly overhead, yadda yadda yadda.

But instead of having a static list of sendi-enabled BTLs, rotate themif there's >1. For example, say I have 3 sendi-enabled BTL modules:A, B, C.

In the first send, A->sendi() is used and it succeeds, so we shufflethe list and return.In the next send, B->sendi() is used and it succeeds, so we shufflethe list and return.In the next send, C->sendi() is used but it fails, so we shuffle thelist and fall through to normal ->send() processing.

"shuffle the list" can be as simple as opal_list_remove_first() andopal_list_append() -- both of which should be O(1).

This should distribute the load across sendi-enabled BTLs, and ifthose ever get "overloaded" (such that sendi fails), we fall throughto normal load-balanced PML sending.


Howzat?



On Mar 2, 2009, at 1:37 PM, Eugene Loh wrote:

I'm on the verge of giving up moving the sendi call in the PML. Iwill try one or two last things, including this e-mail asking forfeedback.
The idea is that when a BTL goes over a very low-latencyinterconnect (like sm), we really want to shave off whatever we canfrom the software stack. One way of doing so is to use a "send-immediate" function, which a few BTLs (like sm) provide. Theproblem is avoiding a bunch of overhead introduced by the PML beforechecking for a "sendi()" call.
Currently, the PML does something like this:

  for ( btl = ... ) {
      if ( SUCCESS == btl->sendi() ) return SUCCESS;
      if ( SUCCESS == btl->send() ) return SUCCESS;
  }
  return ERROR;
That is, it roundrobins over all available BTLs, for each one tryingsendi() and then send(). If ever a sendi or send completessuccessfully, we exit the loop successfully.
The problem is that this loop is buried several functioncalls deepin the PML. Before it reaches this far, the PML has initialized alarge "send request" data structure while traversing some (to me)complicated call graph of functions. This introduces a lot ofoverhead that mitigates much of the speedup we might hope to seewith the sendi function. That overhead is unnecessary for a sendicall, but necessary for a send call. I've tried reorganizing thecode to defer as much of that work as possible -- performing thatoverhead only if it's need to perform a send call -- but I've gottenbraincramp every time I've tried this reorganization.
I think these are the options:

Option A) Punt!

Option B) Have someone more familiar with the PML make these changes.
Option C) Have Eugene keep working at this because he'll learn moreabout the PML and it's good for his character.
Option D) Go to a strategy in which all BTLs are tried for sendibefore any of them is tried for a send. The code would look likethis:
  for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
  for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
  return ERROR;
The reason this is so much easier to achieve is that we can put thatfirst loop way up high in the PML (as soon as a send enters the PML,avoiding all that expensive overhead) and leave the second loopseveral layers down, where it is today. George is against this newloop structure because he thinks round robin selection of BTLs ismost fair and distributes the load over BTLs as evenly as possible.(In contrast, the proposed loop would favor BTLs with sendifunctions.) It seems to me, however, that favoring BTLs that havesendi functions is exactly the right thing to do! I'm not evenconvinced that the conditions he's worried about are that common:multiple eager BTLs to poll, one has a sendi, and that sendi is notvery good or that BTL is getting overloaded.
Anyhow, I like Option D, but George does not.
Option E) Go to a strategy in which the next BTL is tested for asendi function. If there is one, use it. If not, just continuewith the usual heavyweight PML procedure. This feels a littlehackish to me, but it'll mean that most of the time that sendi canbe called, the heavyweight PML overhead will be avoided, while atthe same time "fair" roundrobin polling over the BTLs is maintained.
I'll proceed with Option C for the time being. If I don't announcesuccess or surrender in the next few days, please write to me at theinsane asylum.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] calling sendi earlier in the PML

Reply via email to