Quite strangely, I found 2 good hosts (successfully mount the file system), where the TCP ping goes through on one, while it doe snot on the other (though LNET ping is OK for both).
- Youssef On Wed, Jun 21, 2023 at 6:08 PM Youssef Eldakar <[email protected]> wrote: > Thanks, Rick, for that suggestion. TCP ping between a problematic host and > the MDS indeed does not go through. > > Not exactly sure what to investigate next, but that gives me somewhere to > start... > > - Youssef > > On Tue, Jun 20, 2023 at 7:00 PM Mohr, Rick via lustre-discuss < > [email protected]> wrote: > >> Have you tried tcp pings on the IP addresses associated with the IB >> interfaces? >> >> --Rick >> >> >> On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via >> lustre-discuss" <[email protected] <mailto: >> [email protected]> on behalf of >> [email protected] <mailto:[email protected]>> >> wrote: >> >> >> In a cluster having ~100 Lustre clients (compute nodes) connected >> together with the MDS and OSS over Intel True Scale InfiniBand >> (discontinued product), we started seeing certain nodes failing to mount >> the Lustre file system and giving I/O error on LNET (lctl) ping even though >> an ibping test to the MDS gives no errors. We tried rebooting the >> problematic nodes and even fresh-installing the OS and Lustre client, which >> did not help. However, rebooting the MDS seems to possibly momentarily help >> after the MDS starts up again, but the same set of problematic nodes seem >> to always eventually revert back to the state where they fail to ping the >> MDS over LNET. >> >> >> Thank you for any pointers we may pursue. >> >> >> >> >> Youssef Eldakar >> Bibliotheca Alexandrina >> www.bibalex.org < >> https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&e=> >> < >> https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&amp;e=> >> ;> >> hpc.bibalex.org < >> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&e=> >> < >> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&amp;e=> >> ;> >> >> >> >> >> >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
