Got it, thanks.
Is anyone else looking at that ticket? I'm still a newbie and I suspect
someone else could figure this problem out a lot faster than I could.
So, I'm curious how much I should be looking at this ticket.
If amateurs are allowed to speculate, however, my guess is that this
isn't really a BTL thing. It reminds me of trac ticket 1468 (aka
1516). In that case, there was a lot of one-way traffic. We needed a
way to return frags to the sender. I guess that was solved.
So, the present problem is something different. My guess is that
senders are overrunning receivers. Could it be that some receiver (like
the root in the MPI_Reduce) ends up with too many in-coming messages.
It has to queue up unexpected messages, which slows it down further,
which means it has to deal with even more unexpected messages, etc.
Those messages have to be placed somewhere, which means memory is
allocated, etc.?
Just a theory. I don't know the PML well enough to judge its soundness.
But if this is the case, it's a PML issue rather than a BTL issue.
Maybe there should be some flow control -- particular in our
implementation of collectives!
Ralph Castain wrote:
The connection is only that, if you are going to modify the sm BTL as
you say, you might at least want to be aware that we have a problem
in it so you (a) don't make it worse than it already is, and (b)
might keep an eye open for the problem as you are changing things.
On Feb 12, 2009, at 3:58 PM, Eugene Loh wrote:
Sorry, what's the connection? Are we talking about
https://svn.open-mpi.org/trac/ompi/ticket/1791 ? Are you simply
saying that if I'm doing some sm BTL work, I should also look at
1791? I'm trying to figure out if there's some more specific
connection I'm missing.
Ralph Castain wrote:
You might want to look at ticket #1791 while you are doing this -
Brad added some valuable data earlier today.