We’ll give that a try if it happens again.
At least it is good to know that we aren’t the only ones having issues.

--
Jan van Haarst
HPC Administrator
For Anunna/HPC questions, please use 
https://support.wur.nl<https://support.wur.nl/> (with HPC as service)
Aanwezig: maandag, dinsdag, donderdag & vrijdag
Facilitair Bedrijf, onderdeel van Wageningen University & Research
Afdeling Informatie Technologie
Postbus 59, 6700 AB, Wageningen
Gebouw 116, Akkermaalsbos 12, 6700 WB, Wageningen
http://www.wur.nl/nl/Disclaimer.htm


From: lustre-discuss <[email protected]> on behalf of 
Hans Henrik Happe via lustre-discuss <[email protected]>
Date: Thursday, 29 August 2024 at 16:18
To: [email protected] <[email protected]>
Subject: Re: [lustre-discuss] How to activate an OST on a client ?
Hi,

We just had a similar issue on 2.15.5. Infiniband clients not reconnecting 
after a target outage.

Deleting the LNet net and importing the config again solved it without reboot 
and unmount:

# letctl net del --net 02ib
# lnetctl import < /etc/lnet.conf

Cheers,
Hans Henrik
On 28/08/2024 18.18, Lixin Liu via lustre-discuss wrote:
We had the same problem after we upgraded Lustre servers from 2.12.8 to 2.15.3.
Clients were running 2.15.3 on CentOS 7. Random OST dropped out frequently on
busy login nodes (almost daily), but less so on compute nodes. “lctl” command
cannot active OSTs and reboot we the only way to clear the problem.

In June, we upgraded all client OS to AlmaLinux 9.3 and Lustre version to 
2.15.4 on
both servers and clients (missed 2.15.5 release by about 2 weeks). After the 
upgrade,
we no longer have this problem.

In our case, I wonder this was OmniPath related. Servers on AlamLinux 8 was 
using
in kernel driver, but CentOS 7 clients are using driver from Intel/Cornelis 
release.
Alma 9 clients are now also using in kernel driver.

Cheers,

Lixin.

From: lustre-discuss 
<[email protected]><mailto:[email protected]>
 on behalf of Cameron Harr via lustre-discuss 
<[email protected]><mailto:[email protected]>
Reply-To: Cameron Harr <[email protected]><mailto:[email protected]>
Date: Wednesday, August 28, 2024 at 8:19 AM
To: "[email protected]"<mailto:[email protected]> 
<[email protected]><mailto:[email protected]>
Subject: Re: [lustre-discuss] How to activate an OST on a client ?


There's also an "lctl --device <dev> activate" that I've used in the past 
though I don't know what conditions need to be for it to work.
On 8/27/24 07:46, Andreas Dilger via lustre-discuss wrote:
Hi Jan,
There is "lctl --device XXXX recover" that will trigger a reconnect to the 
named OST device (per "lctl dl" output), but not sure if that will help.


Cheers, Andreas

On Aug 22, 2024, at 06:36, Haarst, Jan van via lustre-discuss 
<[email protected]><mailto:[email protected]> wrote:
Hi,

Probably the wording of the subject doesn’t actually cover the issue, what we 
see is this :
We have a client behind a router (linking tcp to Omnipath) that shows an 
inactive OST (all on 2.15.5).
Other clients that go through the router do not have this issue.

One client had the same issue, although it showed a different OST as inactive.
After a reboot, all was well again on that machine.

The clients can lctl ping the OSSs.

So although we have a workaround (reboot the client), it would be nice to:

  1.  Fix the issue without a reboot
  2.  Fix the underlying issue.

It might be unrelated, but we also see another routing issue every now and then:
The router stops routing request toward a certain OSS, and this can be fixed by 
deleting the peer_nid of the OSS from the router.

I am probably missing informative logs, but I’m more than happy to try to 
generate them, if somebody has a pointer to how.

We are a bit stumped right now.

With kind regards,

--
Jan van Haarst
HPC Administrator
For Anunna/HPC questions, please use 
https://support.wur.nl<https://urldefense.us/v3/__https:/support.wur.nl__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mMjyuAoAg$>
 (with HPC as service)
Aanwezig: maandag, dinsdag, donderdag & vrijdag
Facilitair Bedrijf, onderdeel van Wageningen University & Research
Afdeling Informatie Technologie
Postbus 59, 6700 AB, Wageningen
Gebouw 116, Akkermaalsbos 12, 6700 WB, Wageningen
http://www.wur.nl/nl/Disclaimer.htm<https://urldefense.us/v3/__http:/www.wur.nl/nl/Disclaimer.htm__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mP2LXgG1Q$>

_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$>


_______________________________________________

lustre-discuss mailing list

[email protected]<mailto:[email protected]>

https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$<https://urldefense.us/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1YPSOGUFPvipdg8HUxDkmcB7rvfUxuSATnKZq-9LFTP16TrMxtlrPe7m3ccX4BmKFoLsVnaKiIL3u4pxK2GT6mNJQIy33g$>



_______________________________________________

lustre-discuss mailing list

[email protected]<mailto:[email protected]>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to