Thanks for the info. A few observations I found so far:

- I think LU-10297 has solved my stability issues.
- lustre.conf does work with comma separation of interfaces. I.e. o2ib(ib0,ib1). However, peers need to be configured with ldev.conf or lnetctl. - Defining peering ('lnetctl peer add' and ARP settings) on the client only, seems to make multi-rail work both ways.

I'm a bit puzzled by the last observation. I expected that both ends needed to define peers? The client NID does not show as multi-rail (lnetctl peer show) on the server.

Cheers,
Hans Henrik

On 14-03-2018 03:00, Riccardo Veraldi wrote:
it works for me but you have to set up correctly lnet.conf either
manually or usingĀ  lnetctl to add peers. Then you export your
configuration in lnet.conf
and it will be loaded at reboot. I had to add my peers manually, I think
peer auto discovery is not yet operational on 2.10.3.
I suppose you are not using anymore lustre.conf to configure interfaces
(ib,tcp) and that you are using the new Lustre DLC style:

http://wiki.lustre.org/Dynamic_LNET_Configuration

Also I do not know if you did this yet but you should configure ARP
settings and also rt_tables for your ib interfaces if you use multi-rail.
Here is an example. I had to do that to have things working properly:

https://wiki.hpdd.intel.com/display/LNet/MR+Cluster+Setup

You may also want to check that your IB interfaces (if you have a dual
port infiniband like I have) can really double the performance when you
enable both of them.
The infiniband PCIe card bandwidth has to be capable of feeding enough
traffic to both dual ports or it will just be useful as a fail over device,
without improving the speed as you may want to.

In my configuration fail over is working. If I disconnect one port, the
other will still work. Of course if you disconnect it when traffic is
going through
you may have a problem with that stream of data. But new traffic will be
handled correctly. I do not know if there is a way to avoid this, I am
just talking about my experience and as I said I Am more interested in
performance than fail over.


Riccardo


On 3/13/18 8:05 AM, Hans Henrik Happe wrote:
Hi,

I'm testing LNET multi-rail with 2.10.3 and I ran into some questions
that I couldn't find in the documentation or elsewhere.

As I understand the design document "Dynamic peer discovery" will make
it possible to discover multi-rail peer without adding them manually?
Is that functionality in 2.10.3?

Will failover work without doing anything special? I've tested with
two IB ports and unplugging resulted in no I/O from client and
replugging didn't resolve it.

How do I make and active/passive setup? One example I would really
like to see in the documentation, is the obvious o2ib-tcp combination,
where tcp is used if o2ib is down and fails back if it comes op again.

Anyone using MR in production? Done at bit of testing with dual ib on
both server and client and had a few crashes.

Cheers,
Hans Henrik
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to