Hi Chris and Cory, I remember looking at configuring multi-rail when 2.12 came out for this very reason, but stopped when it looked like round-robin only. Is there a way to trick the LNet Health system into seeing one interface as "sick but not dead"?
Also, when is 2.14 coming out :) For what it's worth, the client errors I'm trying to diagnose (only one client has them) are similar to: [Thu Feb 11 15:51:24 2021] LustreError: 11-0: DFS-L-OST0003-osc-ffff9cd07c339000: operation ost_set_info to node 10.201.32.48@o2ib1 failed: rc = -107 [Thu Feb 11 15:51:24 2021] Lustre: DFS-L-OST0003-osc-ffff9cd07c339000: Connection to DFS-L-OST0003 (at 10.201.32.48@o2ib1) was lost; in progress operations using this service will wait for recovery to complete [Thu Feb 11 15:51:24 2021] LustreError: 167-0: DFS-L-OST0003-osc-ffff9cd07c339000: This client was evicted by DFS-L-OST0003; in progress operations using this service will fail. [Thu Feb 11 15:51:24 2021] Lustre: DFS-L-OST0003-osc-ffff9cd07c339000: Connection restored to 10.201.32.48@o2ib1 (at 10.201.32.48@o2ib1) Thanks, Nate On Thu, Feb 11, 2021 at 1:25 PM Horn, Chris <[email protected]> wrote: > FYI, multi-rail in 2.12 will round robin traffic between both @tcp and > @o2ib networks. If @o2ib flakes out then traffic should shift entirely to > @tcp, but there isn’t a way to specify that traffic go to @tcp only when > there’s a problem with @o2ib. You need the user defined selection policy > feature for that, and that feature is not slated to arrive until after 2.14 > (afaik). > > > > Chris Horn > > > > *From: *lustre-discuss <[email protected]> on > behalf of "Spitz, Cory James" <[email protected]> > *Date: *Thursday, February 11, 2021 at 3:17 PM > *To: *"[email protected]" <[email protected]>, Lustre User > Discussion Mailing List <[email protected]> > *Subject: *Re: [lustre-discuss] LNET IB intermittent connection > *Resent-From: *<[email protected]> > *Resent-Date: *Thursday, February 11, 2021 at 3:17 PM > > > > Hi, Nate. > > > > You asked, “can LNET be easily configured to go over the @tcp connection > when the @o2ib flakes out?” > > > > Yes, you can use LNet Multi-Rail for it and that _*is*_ covered in the > “fine manual”, chapter 16 ☺ > > https://doc.lustre.org/lustre_manual.xhtml#lnetmr > > > > -Cory > > > > On 2/10/21, 4:54 PM, "lustre-discuss" < > [email protected]> wrote: > > > > Hi All, > > > > I've recently been having a bunch of LNET over Infiniband > connection-lost/-restored errors and am trying to find the cause and/or > tune the system to better cope. There is a lot of stuff on the wiki ( > https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency), > but that's from 2016, and I don't know what parts are superseded. I'm > currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel > QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm). > > > > Is there a better place to look (e.g. the fine manual, section X) for > guidance? I've done a few searches on the Jira, but the most similar errors > should have already been fixed in earlier releases. > > > > Assuming that there is actually some impending hardware issue, can LNET > be easily configured to go over the @tcp connection when the @o2ib flakes > out? > > > > Thanks, > > Nate > > > > -- > > Dr. Nathan Crawford [email protected] > > Director of Scientific Computing > > School of Physical Sciences > > 164 Rowland Hall Office: 2101 Natural Sciences II > > University of California, Irvine Phone: 949-824-4508 > > Irvine, CA 92697-2025, USA > > -- Dr. Nathan Crawford [email protected] Director of Scientific Computing School of Physical Sciences 164 Rowland Hall Office: 2101 Natural Sciences II University of California, Irvine Phone: 949-824-4508 Irvine, CA 92697-2025, USA
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
