How about a compromise...

Keep a separate list somewhere of the sendi-enabled BTLs (this avoids looping over all the btl's and testing -- you can just loop over the btl's that you *know* have a sendi). Put that at the top of the PML and avoid the costly overhead, yadda yadda yadda.

But instead of having a static list of sendi-enabled BTLs, rotate them if there's >1. For example, say I have 3 sendi-enabled BTL modules: A, B, C.

In the first send, A->sendi() is used and it succeeds, so we shuffle the list and return. In the next send, B->sendi() is used and it succeeds, so we shuffle the list and return. In the next send, C->sendi() is used but it fails, so we shuffle the list and fall through to normal ->send() processing.

"shuffle the list" can be as simple as opal_list_remove_first() and opal_list_append() -- both of which should be O(1).

This should distribute the load across sendi-enabled BTLs, and if those ever get "overloaded" (such that sendi fails), we fall through to normal load-balanced PML sending.

Howzat?



On Mar 2, 2009, at 1:37 PM, Eugene Loh wrote:

I'm on the verge of giving up moving the sendi call in the PML. I will try one or two last things, including this e-mail asking for feedback.

The idea is that when a BTL goes over a very low-latency interconnect (like sm), we really want to shave off whatever we can from the software stack. One way of doing so is to use a "send- immediate" function, which a few BTLs (like sm) provide. The problem is avoiding a bunch of overhead introduced by the PML before checking for a "sendi()" call.

Currently, the PML does something like this:

  for ( btl = ... ) {
      if ( SUCCESS == btl->sendi() ) return SUCCESS;
      if ( SUCCESS == btl->send() ) return SUCCESS;
  }
  return ERROR;

That is, it roundrobins over all available BTLs, for each one trying sendi() and then send(). If ever a sendi or send completes successfully, we exit the loop successfully.

The problem is that this loop is buried several functioncalls deep in the PML. Before it reaches this far, the PML has initialized a large "send request" data structure while traversing some (to me) complicated call graph of functions. This introduces a lot of overhead that mitigates much of the speedup we might hope to see with the sendi function. That overhead is unnecessary for a sendi call, but necessary for a send call. I've tried reorganizing the code to defer as much of that work as possible -- performing that overhead only if it's need to perform a send call -- but I've gotten braincramp every time I've tried this reorganization.

I think these are the options:

Option A) Punt!

Option B) Have someone more familiar with the PML make these changes.

Option C) Have Eugene keep working at this because he'll learn more about the PML and it's good for his character.

Option D) Go to a strategy in which all BTLs are tried for sendi before any of them is tried for a send. The code would look like this:

  for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
  for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
  return ERROR;

The reason this is so much easier to achieve is that we can put that first loop way up high in the PML (as soon as a send enters the PML, avoiding all that expensive overhead) and leave the second loop several layers down, where it is today. George is against this new loop structure because he thinks round robin selection of BTLs is most fair and distributes the load over BTLs as evenly as possible. (In contrast, the proposed loop would favor BTLs with sendi functions.) It seems to me, however, that favoring BTLs that have sendi functions is exactly the right thing to do! I'm not even convinced that the conditions he's worried about are that common: multiple eager BTLs to poll, one has a sendi, and that sendi is not very good or that BTL is getting overloaded.

Anyhow, I like Option D, but George does not.

Option E) Go to a strategy in which the next BTL is tested for a sendi function. If there is one, use it. If not, just continue with the usual heavyweight PML procedure. This feels a little hackish to me, but it'll mean that most of the time that sendi can be called, the heavyweight PML overhead will be avoided, while at the same time "fair" roundrobin polling over the BTLs is maintained.

I'll proceed with Option C for the time being. If I don't announce success or surrender in the next few days, please write to me at the insane asylum.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to