Thanks, Rick, for that suggestion. TCP ping between a problematic host and the MDS indeed does not go through.
Not exactly sure what to investigate next, but that gives me somewhere to start... - Youssef On Tue, Jun 20, 2023 at 7:00 PM Mohr, Rick via lustre-discuss < [email protected]> wrote: > Have you tried tcp pings on the IP addresses associated with the IB > interfaces? > > --Rick > > > On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via > lustre-discuss" <[email protected] <mailto: > [email protected]> on behalf of > [email protected] <mailto:[email protected]>> > wrote: > > > In a cluster having ~100 Lustre clients (compute nodes) connected together > with the MDS and OSS over Intel True Scale InfiniBand (discontinued > product), we started seeing certain nodes failing to mount the Lustre file > system and giving I/O error on LNET (lctl) ping even though an ibping test > to the MDS gives no errors. We tried rebooting the problematic nodes and > even fresh-installing the OS and Lustre client, which did not help. > However, rebooting the MDS seems to possibly momentarily help after the MDS > starts up again, but the same set of problematic nodes seem to always > eventually revert back to the state where they fail to ping the MDS over > LNET. > > > Thank you for any pointers we may pursue. > > > > > Youssef Eldakar > Bibliotheca Alexandrina > www.bibalex.org < > https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&e=> > < > https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&amp;e=> > ;> > hpc.bibalex.org < > https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&d=DwMFaQ&c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&r=SpEwA4Pnyq7nH7aMGq8KpA&m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&e=> > < > https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&amp;e=> > ;> > > > > > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
