I can't confirm or deny. The only thing I can tell is that the same
test works fine over other BTL, so this tent either to pinpoint a
problem in the sm BTL or in a particular path in the PML (the one used
by the sm BTL). I'll have to dig a little bit more into it, but I was
hoping to do it in the context of the new sm BTL (just to avoid having
to do it twice).
george.
On Feb 13, 2009, at 08:05 , Jeff Squyres wrote:
George -- can you confirm/deny? Is this something we need to fix
for v1.3.1?
On Feb 12, 2009, at 10:15 PM, Eugene Loh wrote:
Got it, thanks.
Is anyone else looking at that ticket? I'm still a newbie and I
suspect someone else could figure this problem out a lot faster
than I could. So, I'm curious how much I should be looking at this
ticket.
If amateurs are allowed to speculate, however, my guess is that
this isn't really a BTL thing. It reminds me of trac ticket 1468
(aka 1516). In that case, there was a lot of one-way traffic. We
needed a way to return frags to the sender. I guess that was solved.
So, the present problem is something different. My guess is that
senders are overrunning receivers. Could it be that some receiver
(like the root in the MPI_Reduce) ends up with too many in-coming
messages. It has to queue up unexpected messages, which slows it
down further, which means it has to deal with even more unexpected
messages, etc. Those messages have to be placed somewhere, which
means memory is allocated, etc.?
Just a theory. I don't know the PML well enough to judge its
soundness.
But if this is the case, it's a PML issue rather than a BTL issue.
Maybe there should be some flow control -- particular in our
implementation of collectives!
Ralph Castain wrote:
The connection is only that, if you are going to modify the sm BTL
as you say, you might at least want to be aware that we have a
problem in it so you (a) don't make it worse than it already is,
and (b) might keep an eye open for the problem as you are
changing things.
On Feb 12, 2009, at 3:58 PM, Eugene Loh wrote:
Sorry, what's the connection? Are we talking about https://svn.open-mpi.org/trac/ompi/ticket/1791
? Are you simply saying that if I'm doing some sm BTL work, I
should also look at 1791? I'm trying to figure out if there's
some more specific connection I'm missing.
Ralph Castain wrote:
You might want to look at ticket #1791 while you are doing this
- Brad added some valuable data earlier today.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel