[OMPI devel] Device failover on ob1

Mouhamed Gueye Fri, 31 Jul 2009 05:50:07 -0400

Hi list,

Here is an update on our work concerning device failover.

As many of you suggested, we reoriented our work on ob1 rather than drand we now have a working prototype on top of ob1. The approach is tostore btl descriptors sent to peers and delete them when we receiveproof of delivery. So far, we rely on completion callback functions,assuming that the message is delivered when the completion function iscalled, that is the case of openib. When a btl module fails, it isremoved from the endpoint's btl list and the next one is used toretransmit stored descriptors. No extra-message is transmitted, it onlyconsists in additions to the header. It has been mainly tested with twoIB modules, in both multi-rail (two separate networks) and multi-path (abig unique network).


You can grab and test the patch here (applies on top of the trunk) :
http://bitbucket.org/gueyem/ob1-failover/

To compile with failover support, just define --enable-device-failoverat configure. You can then run a benchmark, disconnect a port and seethe failover operate.

A little latency increase (~ 2%) is induced by the failover layer whenno failover occurs. To accelerate the failover process on openib, youcan try to lower the btl_openib_ib_timeout openib parameter to 15 forexample instead of 20 (default value).


Mouhamed

[OMPI devel] Device failover on ob1

Reply via email to