George Bosilca wrote:
Here is another way to write the code without having to pay the
expensive initialization of sendreq.
first_time = 0;
for ( btl = ... ) {
if ( SUCCESS == sendi() ) return SUCCESS;
if( 0 == first_time++) set_up_expensive_send_request(&sendreq);
if ( SUCCESS == send(&sendreq) ) return SUCESS;
}
Sure. Well, things are complicated by the fact that
"set_up_expensive_send_request()" is not a factored-out function. So,
restructuring code to look like this is a hassle. But, let's first
figure out what we *want* to do and then tackle what is merely a simple
matter of implementation! :^)
Anyway, the main problem is not in this code. The main problem is in
the fact that now instead of sharing the load over all available BTL
in a round-robin fashion, you overload the BTL(s) providing the sendi
function with small (and eager) messages, and you completely ignore
all the others until something goes wrong.
However, I can see one interesting point in your approach. As the
BTLs are indexed in increasing order of their published latency in
the eager array, we might benefit from the smallest latency for
several small messages before taking the most expensive path. But
this is not something we should tackle allegedly, as it modify the
most performance related parts of the PML.
I would like to understand this better. Let's say you can reach your
destination via two BTLs: sm and TCP. I don't know what the numbers
are, but let's say TCP latency is >10x slower than sm latency. Are you
saying we want to roundrobin between the two BTLs? And to do otherwise
would modify a lot of the PML? Like what?
I can imagine cases where one might have comparable BTLs and want to
round robin them. But, if one BTL is much faster than another, I would
want to use the faster one. Period. Especially if it had a sendi function.