You are more than right. The IB interface 172.21.164.116 is not registered, only the TCP one is

- { index: 234, event: add_uuid, nid: 172.21.156.102@tcp1(0x20001ac159c66), node: 172.21.156.102@tcp1 } - { index: 240, event: add_uuid, nid: 172.21.156.102@tcp1(0x20001ac159c66), node: 172.21.156.102@tcp1 } - { index: 246, event: add_uuid, nid: 172.21.156.102@tcp1(0x20001ac159c66), node: 172.21.156.102@tcp1 }

Do you know how can I register it with the o2ib1 interface ?

I Already reformatted the OSTs but that did not fix the problem.

Thanks

Riccardo



On 9/28/21 11:56 AM, Stephane Thiell wrote:
Hi Riccardo,

I would check if the OSTs on this OSS have been registered with the correct 
NIDs (o2ib1) on the MGS:

$ lctl --device MGS llog_print <fsname>-client

and look for the NIDs in setup/add_conn for the OSTs in question.

Best,

Stephane



On Sep 28, 2021, at 9:52 AM, Riccardo Veraldi <[email protected]> 
wrote:

Hello.

I have a lustre setup where the MDS (172.21.156.112)  is on tcp1 while the 
OSSes are on o2ib1.

I am using Lustre 2.12.7 on RHEL 7.9

All the clients can see the MDS correctly as a tcp1 peer:

peer:
     - primary nid: 172.21.156.112@tcp1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.156.112@tcp1
           state: NA


This is by design because the MDS has no IB interface. So the MDS to OSSes 
traffic and MDS to Clients traffic is on tcp1, while clients to OSSes traffic 
is meant to be on o2ib1.

I have 1 MDS (tcp1)  And 12 OSSes (tcp1, o2ib1) and a bunch of 20 clients 
(tcp1, o2ib1).

All is fine but not for one of the OSSes (172.21.164.116@o2ib1, 
172.21.156.102@tcp1).

Even though it is configured the same as all the other ones, traffic only goes 
through tcp1 and not o2ib1.

Even if I force the peer settings to use o2ib, it ignores it and the tcp1 peer 
is added anyway

this is lnet.conf on the MDS

p2nets:
  - net-spec: o2ib1
    interfaces:
       0: ib0
  - net-spec: tcp1
    interfaces:
       0: eno1
global:
     discovery: 0



this is lnet.conf on OSSes

ip2nets:
  - net-spec: o2ib1
    interfaces:
       0: ib0
  - net-spec: tcp1
    interfaces:
       0: enp1s0f0
global:
     discovery: 0



I also tried this on the lustre clients side:

peer:
     - primary nid: 172.21.164.116@o2ib1
       Multi-Rail: False
       peer ni:
         - nid: 172.21.164.116@o2ib1

enforcing the peer settings to o2ib1.

This is ignored and the peer is added by its tcp1 LNET interface.

     - primary nid: 172.21.156.102@tcp1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.156.102@tcp1
           state: NA

All of the hosts involved have discovery set to 0.

Nevertheless the peer setting for that specific OSS is using tcp1 and not o2ib.

This is disrupting because traffic goes to tcp1 for that specific OSS and it is 
of course slower than IB.

I had to deactivate the OSTs on that specific OSS.

How may I Fix this issue ?

Here is the complete peer list from the lustre client side and as you can see 
there is that specific OSS included as tcp1 peer.

even if I do  "lnetctl peer del --nid 172.21.156.102@tcp1 --prim_nid 
172.21.156.102@tcp1" the entry is added automatically after a while.

lnetctl peer show
peer:
     - primary nid: 172.21.156.112@tcp1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.156.112@tcp1
           state: NA
     - primary nid: 172.21.164.111@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.111@o2ib1
           state: NA
     - primary nid: 172.21.164.117@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.117@o2ib1
           state: NA
     - primary nid: 172.21.164.112@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.112@o2ib1
           state: NA
     - primary nid: 172.21.164.119@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.119@o2ib1
           state: NA
     - primary nid: 172.21.164.114@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.114@o2ib1
           state: NA
     - primary nid: 172.21.164.120@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.120@o2ib1
           state: NA
     - primary nid: 172.21.156.102@tcp1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.156.102@tcp1
           state: NA
     - primary nid: 172.21.164.116@o2ib1
       Multi-Rail: False
       peer ni:
         - nid: 172.21.164.116@o2ib1
           state: NA
     - primary nid: 172.21.164.110@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.110@o2ib1
           state: NA
     - primary nid: 172.21.164.115@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.115@o2ib1
           state: NA
     - primary nid: 172.21.164.118@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.118@o2ib1
           state: NA
     - primary nid: 172.21.164.113@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.113@o2ib1
           state: NA
     - primary nid: 172.21.164.121@o2ib1
       Multi-Rail: True
       peer ni:
         - nid: 172.21.164.121@o2ib1
           state: NA


thanks for looking at this.

Rick











_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to