Re: [lustre-discuss] Network failover between IB and Ethernet

Olaf Weber Tue, 05 Jan 2016 04:59:00 -0800

On 30-12-15 03:22, Hans Henrik Happe wrote:

Hi,


I have a setup where servers and clients are all on both an an ib and an
ethernet network. Now, if one of the networks are lost (dead switch) or
single paths to a node it would be nice if Lustre would failover to the
working network.

Looking through old posts I only found some that are 7+ years old and they
don't agree:

http://lists.lustre.org/htdig.cgi/lustre-discuss-lustre.org/2008-April/001487.html

http://lists.lustre.org/htdig.cgi/lustre-discuss-lustre.org/2008-July/002292.html


There appears to be a problem with the mailing list archives.

Playing with 2.7 my conclusion is that it will not failover like this. Is
that correct?

Lustre doesn't support network failover, but there are some tricks you canuse that might work. Note that you can specify multiple nids for the servernodes using the --servicenode or --failnode options to mkfs.lustre. If youtell lustre that these nids belong to the same node, then a peer connectingto that node will select one of them to use. But if you tell lustre thatthese nids belong to different nodes, then a peer will switch between them.So you'd set up a failover configuration where each node is its own backup.

Note that if you use --servicenode, then a peer will pick the nid to usemore-or-less at random. If the IB network is preferred, you'll have tomanually force a failover when the ethernet gets chosen. For --failnodethere is a preference, but it has some other limitations -- see the manual.In all cases a connection will be used until it fails: there is no automaticfailback.


> Are there plans to address it in the future?

Sort of. Among the projects listed here http://wiki.lustre.org/Projects isthe Multi-Rail LNet project I'm working on, and if we can do everything wewant then (limited) failover functionality will be added to LNet. See

http://wiki.lustre.org/Multi-Rail_LNet for more details.

Bonding will, of cause, work for Ethernet devices.

It seems that with dual OFED devices it is supported with the "ko2iblnd
dev_failover=1" option. Perhaps this might work with Soft-RCoE?

Could routing be used to solve this problem?

Routing can help, but LNet routes between LNet networks, not within them.And it only uses routes if there is no direct connection at all. So if yourcurrent setup is all direct-connected, you end up doubling all networklatencies by putting in LNet routers, just to be able to use routing.


> Cheers,
> Hans Henrik

As a final note: a common failure mode for failover setups is that they endup falling apart when an actual failover occurs, because the remainingsystems or infrastructure cannot handle the load. In your case, when failingover from IB to ethernet you could end up with the cluster falling apart dueto network timeouts caused by high load or congestion. This kind of failuretypically doesn't show up during a test, instead it tends to happen when anunplanned failover happens during production and you really depended on itto just work.


Olaf

--
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                           Veldzigt 2b       Fax:    +31(0)30-6696799
Sr Software Engineer       3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  [email protected]
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Network failover between IB and Ethernet

Reply via email to