Actually, there may be a more important issue here.

Currently, the PML chooses the BTL first. Once the BTL choice is established, only then does the PML choose between sendi and send.

Currently, it's also the case that we're spending a lot of time in the PML doing a bunch of stuff that's totally unnecessary if the sendi succeeds. So, we're neutralizing much of the advantage sendi is supposed to provide.

So, I'm changing the PML to invoke sendi much sooner. The way I'm doing this is to loop over BTLs, looking for a sendi that exists and succeeds. If I find one, I'm done. If I don't, I have to go with the standard send code path.

The logic, as I just described it, allows that multiple sendi functions could fail and that the send that is ultimately used might be for a different BTL than for any of the failing sendi's. This would suggest that I do NOT want failing sendi's leaving any side effects (like allocated descriptors).

Is my proposed logic bad? Should I implement things another way? E.g., if I find a sendi function, use that BTL even if the sendi failed and another BTL might have a sendi that could succeed? Or, does my proposed change provide the justification for my pulling descriptor allocations out of the sendi functions?

Further comments (of less importance) below:

George Bosilca wrote:

On Feb 23, 2009, at 12:14 , Eugene Loh wrote:

George Bosilca wrote:

It doesn't sound reasonable to me. There is a reason for this, and I think it's a good reason. The sendi function work for some devices as a fast path for sending data, when the network is not flooded. However, in the case sendi cannot do the job we expect, the fact that it return the descriptor save us a call (we don't have to do the alloc call later).

This does not make any sense to me. In what sense are we "saving a call"? Not in the sense of run-time performance since the BTL is now having to allocate a descriptor it did not otherwise need. The amount of work is the same (one descriptor allocation either way), but you're just pushing that work into the BTLs.

The descriptor is a BTL resource. If the sendi doesn't return one, the PML will have to call the BTL alloc function from the BTL again (in this case the calls will look like this: btl_sendi followed by btl_alloc followed by btl_send). I'm not looking only at SM, I want all of the BTL to have the opportunity to get good performance.

If sendi return a descriptor when it fails to send the data the call list will be shorter: btl_sendi followed by btl_send. I'm trying to decrease the number of jumps between the layers (PML/BTL), not the number of lines of code in the BTL.

I think architectural streamlining -- even if just a little bit -- is a good thing. And, in this particular case, replicating code into each BTL sendi function just doesn't buy us anything. When the PML allocates the descriptor, it simply calls mca_bml_base_alloc(), which is an *inlined* function that immediately calls the BTL alloc function. No big deal.

Further, having the sendi allocate the descriptor only makes a difference when the BTL has provided a sendi function *AND* when that function failed. That's an edge case. It's much more likely that the BTL doesn't have a sendi function (e.g., openib) *OR* that function sent the message successfully.

I could try comparing performance, but that's a lot of work just to measure "noise".

Reply via email to