On Fri, Dec 14, 2007 at 06:53:55AM -0500, Richard Graham wrote:
> If you have positive confirmation that such things have happened, this will
> go a long way.
I instrumented the code to log all kind of info about fragment reordering while
I chased a bug in openib that caused matching logic to malfunction. Any
non trivial application that uses OpenIB BTL will have reordered fragments.
(I wish this would not be that case, but I don't have a solution yet).

>                 I will not trust the code until this has also been done with
> multiple independent network paths. 
I ran IMB over IP and IB simultaneously on more then 80 ranks.

>                                      I very rarely express such strong
> opinions, even if I don't agree with what is being done, but this is the
> core of correct MPI functionality, and first hand experience has shown that
I agree that this is indeed very important piece of code, but it certain
is not more important than data type engine for instance (and it is much
easier to test all corner cases in matching logic than in data type engine
IMHO). And event if matching code works perfectly, but other parts of
OB1 are buggy the Open MPI will not work properly, so why this code is
chosen to be a sacred cow?

> just thinking through the logic, I can miss some of the race conditions.
That is of cause correct, but the more people will look at the code the
better, isn't it?

> The code here has been running for 8+ years in two production MPI's running
> on very large clusters, so I am very reluctant to make changes for what
Are you sure about this? I see a number of changes to this code during
Open MPI development and current SVN does not hold all the history of
this code unfortunately. Here is the list of commits that I found, part
of them change the code logic quite a bit:
r6770,r7342,r8339,r8352,r8353,r8356,r8946,r11874,r12323,r12582

> seems to amount to people's taste - maintenance is not an issue in this
> case.  Had this not been such a key bit of code, I would not even bat an
Why do you think that maintenance is not an issue? It is for me. Otherwise
I wouldn't even look at this part of code. All those macros prohibit the use
of a debugger for instance.

(And I see a small latency improvement too :))

> eye.  I suppose if you can go through some formal verification, this would
> also be good - actually better than hoping that one will hit out-of-order
> situations.
> 
> Rich
> 
> 
> On 12/14/07 2:20 AM, "Gleb Natapov" <gl...@voltaire.com> wrote:
> 
> > On Thu, Dec 13, 2007 at 06:16:49PM -0500, Richard Graham wrote:
> >> The situation that needs to be triggered, just as George has mentions, is
> >> where we have a lot of unexpected messages, to make sure that when one that
> >> we can match against comes in, all the unexpected messages that can be
> >> matched with pre-posted receives are matched.  Since we attempt to match
> >> only when a new fragment comes in, we need to make sure that we don't leave
> >> other unexpected messages that can be matched in the unexpected queue, as
> >> these (if the out of order scenario is just right) would block any new
> >> matches from occurring.
> >> 
> >> For example:  Say the next expect message is 25
> >> 
> >> Unexpected message queue has:  26 28 29 ..
> >> 
> >> If 25 comes in, and is handled, if 26 is not pulled off the unexpected
> >> message queue, when 27 comes in it won't be able to be matched, as 26 is
> >> sitting in the unexpected queue, and will never be looked at again ...
> > This situation is triggered constantly with openib BTL. OpenIB BTL has
> > two ways to receive a packet: over a send queue or over an eager RDMA path.
> > Receiver polls both of them and may reorders packets locally. Actually
> > currently there is a bug in openib BTL that one channel may starve the other
> > at the receiver so if a match fragment with a next sequence number is in the
> > starved path tenth of thousands fragment can be reorederd. Test case 
> > attached
> > to ticket #1158 triggers this case and my patch handles all reordered 
> > packets.
> > 
> > And, by the way, the code is much simpler now and can be review easily ;)
> > 
> > --
> > Gleb.
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.

Reply via email to