Hi Nathan, Have you examined the underlying fabric to ensure it's functioning correctly?
https://www.mellanox.com/products/adapter-software/infiniband-management-and-monitoring-tools might interest you -cf On Wed, Feb 10, 2021 at 3:54 PM Nathan Crawford <[email protected]> wrote: > Hi All, > > I've recently been having a bunch of LNET over Infiniband > connection-lost/-restored errors and am trying to find the cause and/or > tune the system to better cope. There is a lot of stuff on the wiki ( > https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency), > but that's from 2016, and I don't know what parts are superseded. I'm > currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel > QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm). > > Is there a better place to look (e.g. the fine manual, section X) for > guidance? I've done a few searches on the Jira, but the most similar errors > should have already been fixed in earlier releases. > > Assuming that there is actually some impending hardware issue, can LNET > be easily configured to go over the @tcp connection when the @o2ib flakes > out? > > Thanks, > Nate > > -- > > Dr. Nathan Crawford [email protected] > Director of Scientific Computing > School of Physical Sciences > 164 Rowland Hall Office: 2101 Natural Sciences II > University of California, Irvine Phone: 949-824-4508 > Irvine, CA 92697-2025, USA > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
