Hi Colin, I've done checks of the performance/error counters, and used the in-OS-repo version ibdiagnet. Apart from a couple nodes with known failing cables/HCAs (not involved in lnet connectino probs), the fabric was healthy. It did pick up that the IPoIB partition was still at 20gbit/s from when we had a couple DDR connections, so increasing that to 40 may help.
The current suspect is that the ZFS pools under the OSTs recently got much too close to capacity (>%90), and are taking longer times to process IO. Is there a set of timeouts to increase or thresholds to loosen in order to cope? Thanks, Nate On Wed, Feb 10, 2021 at 3:24 PM Colin Faber <[email protected]> wrote: > Hi Nathan, > > Have you examined the underlying fabric to ensure it's functioning > correctly? > > > https://www.mellanox.com/products/adapter-software/infiniband-management-and-monitoring-tools > might interest you > > -cf > > On Wed, Feb 10, 2021 at 3:54 PM Nathan Crawford <[email protected]> wrote: > >> Hi All, >> >> I've recently been having a bunch of LNET over Infiniband >> connection-lost/-restored errors and am trying to find the cause and/or >> tune the system to better cope. There is a lot of stuff on the wiki ( >> https://wiki.lustre.org/Lustre_Resiliency:_Understanding_Lustre_Message_Loss_and_Tuning_for_Resiliency), >> but that's from 2016, and I don't know what parts are superseded. I'm >> currently running Lustre 2.12.5 on CentOS 7.8, with a mix of Q-Logic/Intel >> QDR and Mellanox EDR HCAs and switches (using CentOS in-box RDMA/opensm). >> >> Is there a better place to look (e.g. the fine manual, section X) for >> guidance? I've done a few searches on the Jira, but the most similar errors >> should have already been fixed in earlier releases. >> >> Assuming that there is actually some impending hardware issue, can LNET >> be easily configured to go over the @tcp connection when the @o2ib flakes >> out? >> >> Thanks, >> Nate >> >> -- >> >> Dr. Nathan Crawford [email protected] >> Director of Scientific Computing >> School of Physical Sciences >> 164 Rowland Hall Office: 2101 Natural Sciences II >> University of California, Irvine Phone: 949-824-4508 >> Irvine, CA 92697-2025, USA >> >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> > -- Dr. Nathan Crawford [email protected] Director of Scientific Computing School of Physical Sciences 164 Rowland Hall Office: 2101 Natural Sciences II University of California, Irvine Phone: 949-824-4508 Irvine, CA 92697-2025, USA
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
