Actually, there may be a more important issue here.
Currently, the PML chooses the BTL first. Once the BTL choice is
established, only then does the PML choose between sendi and send.
Currently, it's also the case that we're spending a lot of time in the
PML doing a bunch of stuff that's totally unnecessary if the sendi
succeeds. So, we're neutralizing much of the advantage sendi is
supposed to provide.
So, I'm changing the PML to invoke sendi much sooner. The way I'm doing
this is to loop over BTLs, looking for a sendi that exists and
succeeds. If I find one, I'm done. If I don't, I have to go with the
standard send code path.
The logic, as I just described it, allows that multiple sendi functions
could fail and that the send that is ultimately used might be for a
different BTL than for any of the failing sendi's. This would suggest
that I do NOT want failing sendi's leaving any side effects (like
allocated descriptors).
Is my proposed logic bad? Should I implement things another way? E.g.,
if I find a sendi function, use that BTL even if the sendi failed and
another BTL might have a sendi that could succeed? Or, does my proposed
change provide the justification for my pulling descriptor allocations
out of the sendi functions?
Further comments (of less importance) below:
George Bosilca wrote:
On Feb 23, 2009, at 12:14 , Eugene Loh wrote:
George Bosilca wrote:
It doesn't sound reasonable to me. There is a reason for this, and
I think it's a good reason. The sendi function work for some
devices as a fast path for sending data, when the network is not
flooded. However, in the case sendi cannot do the job we expect,
the fact that it return the descriptor save us a call (we don't
have to do the alloc call later).
This does not make any sense to me. In what sense are we "saving a
call"? Not in the sense of run-time performance since the BTL is
now having to allocate a descriptor it did not otherwise need. The
amount of work is the same (one descriptor allocation either way),
but you're just pushing that work into the BTLs.
The descriptor is a BTL resource. If the sendi doesn't return one,
the PML will have to call the BTL alloc function from the BTL again
(in this case the calls will look like this: btl_sendi followed by
btl_alloc followed by btl_send). I'm not looking only at SM, I want
all of the BTL to have the opportunity to get good performance.
If sendi return a descriptor when it fails to send the data the call
list will be shorter: btl_sendi followed by btl_send. I'm trying to
decrease the number of jumps between the layers (PML/BTL), not the
number of lines of code in the BTL.
I think architectural streamlining -- even if just a little bit -- is a
good thing. And, in this particular case, replicating code into each
BTL sendi function just doesn't buy us anything. When the PML allocates
the descriptor, it simply calls mca_bml_base_alloc(), which is an
*inlined* function that immediately calls the BTL alloc function. No
big deal.
Further, having the sendi allocate the descriptor only makes a
difference when the BTL has provided a sendi function *AND* when that
function failed. That's an edge case. It's much more likely that the
BTL doesn't have a sendi function (e.g., openib) *OR* that function sent
the message successfully.
I could try comparing performance, but that's a lot of work just to
measure "noise".