Hi Horn, Thanks for your help. I checked the description of the ticket you mentioned. The result are similar but the root cause sames different. While I was waiting for the account, I tried debugging this issue with my colleague. We found the root cause. And we already trying to fix the problem. Hopefully we can provide a path about this issue. Simply, it's a network namespace problem. In another network namespace a NIC named eth0 and with index 2 can be created. Then LNet will think the eth0 in default namspace is changed and then it corrupts.
Regards, Chuanjun Horn, Chris <[email protected]> 于2023年3月2日周四 01:16写道: > Hi CJ, > > > > I don’t know if you ever got an account and ticket opened, but I stumbled > upon this change which sounds like it could be your issue - > https://jira.whamcloud.com/browse/LU-16378 > > commit 3c9282a67d73799a03cb1d254275685c1c1e4df2 > > Author: Cyril Bordage [email protected] > > Date: Sat Dec 10 01:51:16 2022 +0100 > > > > LU-16378 lnet: handles unregister/register events > > > > When network is restarted, devices are unregistered and then > > registered again. When a device registers using an index that is > > different from the previous one (before network was restarted), LNet > > ignores it. Consequently, this device stays with link in fatal state. > > > > To fix that, we catch unregistering events to clear the saved index > > value, and when a registering event comes, we save the new value. > > > > Chris Horn > > > > *From: *CJ Yin <[email protected]> > *Date: *Sunday, February 19, 2023 at 12:23 AM > *To: *Horn, Chris <[email protected]> > *Cc: *[email protected] <[email protected]> > *Subject: *Re: [lustre-discuss] LNet nid down after some thing changed > the NICs > > Hi Chris, > > > > Thanks for your help. I have collected the relevant logs according to your > hints. But I need an account to open a ticket on Jira. I have sent an > email to the administrator at [email protected]. I was wondering if this > is the correct way to apply for an account. I only found this email on the > site. > > > > Regards, > > Chuanjun > > > > Horn, Chris <[email protected]> 于2023年2月18日周六 00:52写道: > > If deleting and re-adding it restores the status to up then this sounds > like a bug to me. > > > > Can you enable debug tracing, reproduce the issue, and add this > information to a ticket? > > To enable/gather debug: > > # lctl set_param debug=+net > <reproduce issue> > # lctl dk > /tmp/dk.log > > You can create a ticket at https://jira.whamcloud.com/ > > Please provide the dk.log with the ticket. > > > > Thanks, > > Chris Horn > > > > *From: *lustre-discuss <[email protected]> on > behalf of 腐朽银 via lustre-discuss <[email protected]> > *Date: *Friday, February 17, 2023 at 2:53 AM > *To: *[email protected] <[email protected]> > *Subject: *[lustre-discuss] LNet nid down after some thing changed the > NICs > > Hi, > > > > I encountered a problem when using Lustre Client on k8s with kubenet. Very > happy if you could help me. > > > > My LNet configuration is: > > > > net: > - net type: lo > local NI(s): > - nid: 0@lo > status: up > - net type: tcp > local NI(s): > - nid: 10.224.0.5@tcp > status: up > interfaces: > 0: eth0 > > > > It works. But after I deploy or delete a pod on the node. The nid goes > down like: > > > > - nid: 10.224.0.5@tcp > status: down > interfaces: > 0: eth0 > > > > k8s uses veth pairs, so it will add or delete network interfaces when > deploying or deleting pods. But it doesn't touch the eth0 NIC. I can fix it > by deleting the tcp net by `lnetctl net del` and re-add it by `lnetctl net > add`. But I need to do this every time after a pod is scheduled to this > node. > > > > My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by > myself from 2.15.1. Is this an expected LNet behavior or I got something > wrong? I re-build and tested it several times and got the same problem. > > > > Regards, > > Chuanjun > >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
