Hi All, I've recently been having a bunch of LNET over Infiniband connection-lost/-restored errors and am trying to find the cause and/or tune the system to better cope. There is a lot of stuff on the wiki ( https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency), but that's from 2016, and I don't know what parts are superseded. I'm currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm).
Is there a better place to look (e.g. the fine manual, section X) for guidance? I've done a few searches on the Jira, but the most similar errors should have already been fixed in earlier releases. Assuming that there is actually some impending hardware issue, can LNET be easily configured to go over the @tcp connection when the @o2ib flakes out? Thanks, Nate -- Dr. Nathan Crawford [email protected] Director of Scientific Computing School of Physical Sciences 164 Rowland Hall Office: 2101 Natural Sciences II University of California, Irvine Phone: 949-824-4508 Irvine, CA 92697-2025, USA
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
