Have you tried tcp pings on the IP addresses associated with the IB interfaces?

--Rick


On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via 
lustre-discuss" <[email protected] 
<mailto:[email protected]> on behalf of 
[email protected] <mailto:[email protected]>> wrote:


In a cluster having ~100 Lustre clients (compute nodes) connected together with 
the MDS and OSS over Intel True Scale InfiniBand (discontinued product), we 
started seeing certain nodes failing to mount the Lustre file system and giving 
I/O error on LNET (lctl) ping even though an ibping test to the MDS gives no 
errors. We tried rebooting the problematic nodes and even fresh-installing the 
OS and Lustre client, which did not help. However, rebooting the MDS seems to 
possibly momentarily help after the MDS starts up again, but the same set of 
problematic nodes seem to always eventually revert back to the state where they 
fail to ping the MDS over LNET.


Thank you for any pointers we may pursue.




Youssef Eldakar
Bibliotheca Alexandrina
www.bibalex.org 
<https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&amp;e=>
 
<https://urldefense.us/v2/url?u=http-3A__www.bibalex.org&amp;amp;d=DwMFaQ&amp;amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;amp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMA&amp;amp;e=&gt;>
hpc.bibalex.org 
<https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&amp;d=DwMFaQ&amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&amp;e=>
 
<https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.org&amp;amp;d=DwMFaQ&amp;amp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYc&amp;amp;r=SpEwA4Pnyq7nH7aMGq8KpA&amp;amp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDH&amp;amp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQw&amp;amp;e=&gt;>





_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to