Re: [lustre-discuss] LNet nid down after some thing changed the NICs

CJ Yin via lustre-discuss Thu, 09 Mar 2023 01:04:47 -0800

Hi Horn,

Thanks for your help. I checked the description of the ticket you
mentioned. The result are similar but the root cause sames different. While
I was waiting for the account, I tried debugging this issue with my
colleague. We found the root cause. And we already trying to fix the
problem. Hopefully we can provide a path about this issue. Simply, it's a
network namespace problem. In another network namespace a NIC named eth0
and with index 2 can be created. Then LNet will think the eth0 in default
namspace is changed and then it corrupts.


Regards,
Chuanjun

Horn, Chris <[email protected]> 于2023年3月2日周四 01:16写道：

> Hi CJ,
>
>
>
> I don’t know if you ever got an account and ticket opened, but I stumbled
> upon this change which sounds like it could be your issue -
> https://jira.whamcloud.com/browse/LU-16378
>
> commit 3c9282a67d73799a03cb1d254275685c1c1e4df2
>
> Author: Cyril Bordage [email protected]
>
> Date:   Sat Dec 10 01:51:16 2022 +0100
>
>
>
>     LU-16378 lnet: handles unregister/register events
>
>
>
>     When network is restarted, devices are unregistered and then
>
>    registered again. When a device registers using an index that is
>
>     different from the previous one (before network was restarted), LNet
>
>     ignores it. Consequently, this device stays with link in fatal state.
>
>
>
>     To fix that, we catch unregistering events to clear the saved index
>
>     value, and when a registering event comes, we save the new value.
>
>
>
> Chris Horn
>
>
>
> *From: *CJ Yin <[email protected]>
> *Date: *Sunday, February 19, 2023 at 12:23 AM
> *To: *Horn, Chris <[email protected]>
> *Cc: *[email protected] <[email protected]>
> *Subject: *Re: [lustre-discuss] LNet nid down after some thing changed
> the NICs
>
> Hi Chris,
>
>
>
> Thanks for your help. I have collected the relevant logs according to your
> hints. But I need an account to open a ticket on Jira. I have sent an
> email to the administrator at [email protected]. I was wondering if this
> is the correct way to apply for an account. I only found this email on the
> site.
>
>
>
> Regards,
>
> Chuanjun
>
>
>
> Horn, Chris <[email protected]> 于2023年2月18日周六 00:52写道：
>
> If deleting and re-adding it restores the status to up then this sounds
> like a bug to me.
>
>
>
> Can you enable debug tracing, reproduce the issue, and add this
> information to a ticket?
>
> To enable/gather debug:
>
> # lctl set_param debug=+net
> <reproduce issue>
> # lctl dk > /tmp/dk.log
>
> You can create a ticket at https://jira.whamcloud.com/
>
> Please provide the dk.log with the ticket.
>
>
>
> Thanks,
>
> Chris Horn
>
>
>
> *From: *lustre-discuss <[email protected]> on
> behalf of 腐朽银 via lustre-discuss <[email protected]>
> *Date: *Friday, February 17, 2023 at 2:53 AM
> *To: *[email protected] <[email protected]>
> *Subject: *[lustre-discuss] LNet nid down after some thing changed the
> NICs
>
> Hi,
>
>
>
> I encountered a problem when using Lustre Client on k8s with kubenet. Very
> happy if you could help me.
>
>
>
> My LNet configuration is:
>
>
>
> net:
>     - net type: lo
>       local NI(s):
>         - nid: 0@lo
>           status: up
>     - net type: tcp
>       local NI(s):
>         - nid: 10.224.0.5@tcp
>           status: up
>           interfaces:
>               0: eth0
>
>
>
> It works. But after I deploy or delete a pod on the node. The nid goes
> down like:
>
>
>
> - nid: 10.224.0.5@tcp
>           status: down
>           interfaces:
>               0: eth0
>
>
>
> k8s uses veth pairs, so it will add or delete network interfaces when
> deploying or deleting pods. But it doesn't touch the eth0 NIC. I can fix it
> by deleting the tcp net by `lnetctl net del` and re-add it by `lnetctl net
> add`. But I need to do this every time after a pod is scheduled to this
> node.
>
>
>
> My node OS is Ubuntu 18.04 5.4.0-1101-azure. The Lustre Client is built by
> myself from 2.15.1. Is this an expected LNet behavior or I got something
> wrong? I re-build and tested it several times and got the same problem.
>
>
>
> Regards,
>
> Chuanjun
>
>

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] LNet nid down after some thing changed the NICs

Reply via email to