Re: [OMPI devel] Device failover on ob1

Ralph Castain Sun, 2 Aug 2009 18:20:09 -0400

The objections being cited are somewhat unfair - perhaps people do notunderstand the proposal being made? The developers have gone out oftheir way to ensure that all changes are configured out unless youspecifically select to use that functionality. This has been ourpolicy from day one - as long as the changes have zero impact unlessthe user specifically requests that it be used, then no harm is done.

So I personally don't see any objection to bringing it into the codebase. Latency is not impacted one bit -unless- someone deliberatelyconfigures the code to use this feature. In that case, they aredeliberately accepting any impact in order to gain the benefits.

Perhaps a bigger question needs to be addressed - namely, does the ob1code need to be refactored?

Having been involved a little in the early discussion with bull whenwe debated over where to put this, I know the primary concern was thatthe code not suffer the same fate as the dr module. We have since runinto a similar issue with the checksum module, so I know where theyare coming from.

The problem is that the code base is adjusted to support changes inob1, which is still being debugged. On the order of 95% of the code inob1 is required to be common across all the pml modules, so the restof us have to (a) watch carefully all the commits to see if someonetouches ob1, and then (b) manually mirror the change in our modules.

This is not a supportable model over the long-term, which is why drhas died, and checksum is considering integrating into ob1 usingconfigure #if's to avoid impacting non-checksum users. Likewise,device failover has been treated similarly here - i.e., configure outthe added code unless someone wants it.

This -does- lead to messier source code with these #if's in it. If wecan refactor the ob1 code so the common functionality resides in thebase, then perhaps we can avoid this problem.


Is it possible?
Ralph

On Aug 2, 2009, at 3:25 PM, Graham, Richard L. wrote:




On 8/2/09 12:55 AM, "Brian Barrett" <[email protected]> wrote:

While I agree that performance impact (latency in this case) is
important, I disagree that this necessarily belongs somewhere other
than ob1.  For example, a zero-performance impact solution would be to
provide two versions of all the interface functions, one with failover
turned on and one with it turned off, and select the appropriate
functions at initialization time.  There are others, including careful
placement of decision logic, which are likely to result in near-zero
impact.  I'm not attempting to prescribe a solution, but refuting the
claim that this can't be in ob1 - I think more data is needed before
such a claim is made.

Just another way to do handle set the function pointers.


Mouhamed - can the openib btl try to re-establish a connection between
two peers today (with your ob1 patches, obviously)?  Would this allow
us to adapt to changing routes due to switch failures (assuming that
there are other physical routes around the failed switch, of course)?

The big question is what are the assumptions that are being made
for this mode of failure recovery. If the assumption is thatlocal completionimplies remote delivery, the problem is simple to solve. If not,heavier
weight protocols need to be used to cover the range of ways failure
may manifest itself.


Rich

Thanks,

Brian

On Aug 1, 2009, at 6:21 PM, Graham, Richard L. wrote:

What is the impact on sm, which is by far the most sensitive to
latency. This really belongs in a place other than ob1.  Ob1 is
supposed to provide the lowest latency possible, and other pml's are
supposed to be used for heavier weight protocols.

On the technical side, how do you distinguish between a lot
acknowledgement and an undelivered message ?  You really don't want
to try and deliver data into user space twice, as once a receive is
complete, who knows what the user has done with that buffer ?  A
general treatment needs to be able to false negatives, and attempts
to deliver the data more than once.

How are you detecting missing acknowledgements ?  Are you using some
sort of timer ?

Rich

On 7/31/09 5:49 AM, "Mouhamed Gueye" <[email protected]> wrote:

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather thandr

and we now have a working prototype on top of ob1. The approach is to
store btl descriptors sent to peers and delete them when we receive
proof of delivery. So far, we rely on completion callback functions,

assuming that the message is delivered when the completion functionis

called, that is the case of openib. When a btl module fails, it is
removed from the endpoint's btl list and the next one is used to
retransmit stored descriptors. No extra-message is transmitted, it
only
consists in additions to the header. It has been mainly tested with
two
IB modules, in both multi-rail (two separate networks) and multi-
path (a
big unique network).

You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device-failover

at configure. You can then run a benchmark, disconnect a port and see
the failover operate.

A little latency increase (~ 2%) is induced by the failover layerwhen

no failover occurs. To accelerate the failover process on openib, you
can try to lower the btl_openib_ib_timeout openib parameter to 15 for
example instead of 20 (default value).

Mouhamed
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
  Brian Barrett
  Open MPI developer
  http://www.open-mpi.org/


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Device failover on ob1

Reply via email to