Have you run ibdiagnet? Also you want to run ibqueryerrors On Tue, 20 Jun 2023, 17:11 Youssef Eldakar via lustre-discuss, < [email protected]> wrote:
> In a cluster having ~100 Lustre clients (compute nodes) connected together > with the MDS and OSS over Intel True Scale InfiniBand (discontinued > product), we started seeing certain nodes failing to mount the Lustre file > system and giving I/O error on LNET (lctl) ping even though an ibping test > to the MDS gives no errors. We tried rebooting the problematic nodes and > even fresh-installing the OS and Lustre client, which did not help. > However, rebooting the MDS seems to possibly momentarily help after the MDS > starts up again, but the same set of problematic nodes seem to always > eventually revert back to the state where they fail to ping the MDS over > LNET. > > Thank you for any pointers we may pursue. > > Youssef Eldakar > Bibliotheca Alexandrina > www.bibalex.org > hpc.bibalex.org > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
