I can't confirm or deny. The only thing I can tell is that the same test works fine over other BTL, so this tent either to pinpoint a problem in the sm BTL or in a particular path in the PML (the one used by the sm BTL). I'll have to dig a little bit more into it, but I was hoping to do it in the context of the new sm BTL (just to avoid having to do it twice).

  george.

On Feb 13, 2009, at 08:05 , Jeff Squyres wrote:

George -- can you confirm/deny? Is this something we need to fix for v1.3.1?

On Feb 12, 2009, at 10:15 PM, Eugene Loh wrote:

Got it, thanks.

Is anyone else looking at that ticket? I'm still a newbie and I suspect someone else could figure this problem out a lot faster than I could. So, I'm curious how much I should be looking at this ticket.

If amateurs are allowed to speculate, however, my guess is that this isn't really a BTL thing. It reminds me of trac ticket 1468 (aka 1516). In that case, there was a lot of one-way traffic. We needed a way to return frags to the sender. I guess that was solved.

So, the present problem is something different. My guess is that senders are overrunning receivers. Could it be that some receiver (like the root in the MPI_Reduce) ends up with too many in-coming messages. It has to queue up unexpected messages, which slows it down further, which means it has to deal with even more unexpected messages, etc. Those messages have to be placed somewhere, which means memory is allocated, etc.?

Just a theory. I don't know the PML well enough to judge its soundness.

But if this is the case, it's a PML issue rather than a BTL issue. Maybe there should be some flow control -- particular in our implementation of collectives!

Ralph Castain wrote:

The connection is only that, if you are going to modify the sm BTL as you say, you might at least want to be aware that we have a problem in it so you (a) don't make it worse than it already is, and (b) might keep an eye open for the problem as you are changing things.

On Feb 12, 2009, at 3:58 PM, Eugene Loh wrote:

Sorry, what's the connection? Are we talking about https://svn.open-mpi.org/trac/ompi/ticket/1791 ? Are you simply saying that if I'm doing some sm BTL work, I should also look at 1791? I'm trying to figure out if there's some more specific connection I'm missing.

Ralph Castain wrote:

You might want to look at ticket #1791 while you are doing this - Brad added some valuable data earlier today.

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to