George Bosilca wrote:

Here is another way to write the code without having to pay the expensive initialization of sendreq.
   first_time = 0;
   for ( btl = ... ) {
       if ( SUCCESS == sendi() ) return SUCCESS;
       if( 0 == first_time++)  set_up_expensive_send_request(&sendreq);
       if ( SUCCESS == send(&sendreq) ) return SUCESS;
   }

Sure. Well, things are complicated by the fact that "set_up_expensive_send_request()" is not a factored-out function. So, restructuring code to look like this is a hassle. But, let's first figure out what we *want* to do and then tackle what is merely a simple matter of implementation! :^)

Anyway, the main problem is not in this code. The main problem is in the fact that now instead of sharing the load over all available BTL in a round-robin fashion, you overload the BTL(s) providing the sendi function with small (and eager) messages, and you completely ignore all the others until something goes wrong.

However, I can see one interesting point in your approach. As the BTLs are indexed in increasing order of their published latency in the eager array, we might benefit from the smallest latency for several small messages before taking the most expensive path. But this is not something we should tackle allegedly, as it modify the most performance related parts of the PML.

I would like to understand this better. Let's say you can reach your destination via two BTLs: sm and TCP. I don't know what the numbers are, but let's say TCP latency is >10x slower than sm latency. Are you saying we want to roundrobin between the two BTLs? And to do otherwise would modify a lot of the PML? Like what?

I can imagine cases where one might have comparable BTLs and want to round robin them. But, if one BTL is much faster than another, I would want to use the faster one. Period. Especially if it had a sendi function.

Reply via email to