FYI, multi-rail in 2.12 will round robin traffic between both @tcp and @o2ib 
networks (assuming peers are reachable on both). If @o2ib flakes out then 
traffic should shift entirely to @tcp, but there isn’t a way to specify that 
traffic go to @tcp only when there’s a problem with @o2ib. You need the user 
defined selection policy feature for that, and that feature is not slated to 
arrive until after 2.14 (afaik).

Chris Horn

From: lustre-discuss <[email protected]> on behalf of 
"Spitz, Cory James" <[email protected]>
Date: Thursday, February 11, 2021 at 3:17 PM
To: "[email protected]" <[email protected]>, Lustre User Discussion 
Mailing List <[email protected]>
Subject: Re: [lustre-discuss] LNET IB intermittent connection
Resent-From: <[email protected]>
Resent-Date: Thursday, February 11, 2021 at 3:17 PM

Hi, Nate.

You asked, “can LNET be easily configured to go over the @tcp connection when 
the @o2ib flakes out?”

Yes, you can use LNet Multi-Rail for it and that _is_ covered in the “fine 
manual”, chapter 16 ☺
https://doc.lustre.org/lustre_manual.xhtml#lnetmr<https://doc.lustre.org/lustre_manual.xhtml#lnetmr>

-Cory

On 2/10/21, 4:54 PM, "lustre-discuss" <[email protected]> 
wrote:

Hi All,

  I've recently been having a bunch of LNET over Infiniband 
connection-lost/-restored errors and am trying to find the cause and/or tune 
the system to better cope. There is a lot of stuff on the wiki ( 
https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency<https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency>),
 but that's from 2016, and I don't know what parts are superseded. I'm 
currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel QDR 
and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).

  Is there a better place to look (e.g. the fine manual, section X) for 
guidance? I've done a few searches on the Jira, but the most similar errors 
should have already been fixed in earlier releases.

  Assuming that there is actually some impending hardware issue, can LNET be 
easily configured to go over the @tcp connection when the @o2ib flakes out?

Thanks,
Nate

--

Dr. Nathan Crawford              
[email protected]<mailto:[email protected]>

Director of Scientific Computing

School of Physical Sciences

164 Rowland Hall                 Office: 2101 Natural Sciences II

University of California, Irvine  Phone: 949-824-4508

Irvine, CA 92697-2025, USA
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to