Hello Amir i answer in-line

On 3/5/19 3:42 PM, Amir Shehata wrote:
It looks like the ping is passing. Did you try it several times to make sure it always pings successfully?

The way it works is the MDS (2.12) discovers all the interfaces on the peer. There is a concept of the primary NID for the peer. That's the first interface configured on the peer. In your case it's the o2ib NID. So when you do lnetctl net show you'll see Primary NID: <nid>@o2ib.

    - primary nid: 172.21.52.88@o2ib
       Multi-Rail: True
       peer ni:
         - nid: 172.21.48.250@tcp
           state: NA
         - nid: 172.21.52.88@o2ib
           state: NA
         - nid: 172.21.48.250@tcp1
           state: NA
         - nid: 172.21.48.250@tcp2
           state: NA

On the MDS it uses the primary_nid to identify the peer. So you can ping using the Primary NID. LNet will resolve the Primary NID to the tcp NID. As you can see in the logs, it never actually talks over o2ib. It ends up talking to the peer on its TCP NID, which is what you want to do.

I think the problem you're seeing is caused by the combination of 2.12 and 2.10.x.
From what I understand your servers are 2.12 and your clients are 2.10.x.
my clients are 2.10.5 but this problem arise also with one client 2.12.0, anyway the combination of 2.10.0 clients and 2.12.0 is not working right

Can you try disabling dynamic discovery on your servers:
lnetctl set discovery 0

I did this on the MDS and OSS. I did not disable discovery on the client side.

now on the MDS side lnetctl peer show looks right.

Anyway on the client side where I have both IB and tcp if I write on the lustre filesystem (OSS) what hapens is that the write operation is splitte/load balanced between IB and tcp (Ethernet) and I do not want this. I would like that only IB would be used when the client writes data to the OSS. but both peer ni (o2ib,tcp) are seen from the 2.12.0 client and traffic goes to both of them thus reducing performances because IB is not fully used. This does not happen with 2.10.5 client writing on the same 2.12.0 OSS



Do that as part of the initial bring up to make sure 2.12 nodes don't try to discover peers. Let me know if that resolves your issue?

On Tue, 5 Mar 2019 at 15:09, Riccardo Veraldi <[email protected] <mailto:[email protected]>> wrote:

    it is not exactly this problem.

    here is my setup

      * MDS is on tcp0
      * client is on tcp0 and o2ib0
      * OSS is on tcp0 and o2ib0

    The problem is that the MDS is discovering both the lustre client
    and the OSS as well over o2ib and it should not because the MDS
    has only one ethernet interface. I can see this from lnetctl peer
    show. This did not happen prior to upgrading to Lustre 2.12.0 from
    2.10.5

    so I tried to debug with lctl ping from the MDS to the lustre client

    [root@psmdsana1501 ~]# lctl ping 172.21.48.250@tcp
    12345-0@lo
    12345-172.21.52.88@o2ib
    12345-172.21.48.250@tcp

    172.21.52.88 is the ib interface of the lustre client.

    so I did

    [root@psmdsana1501 ~]# lctl ping 172.21.52.88@o2ib
    12345-0@lo
    12345-172.21.52.88@o2ib
    12345-172.21.48.250@tcp


    this is the mds lnet.conf

    ip2nets:
     - net-spec: tcp
       interfaces:
          0: eth0

    then I did as you suggested on the MDS:

    lctl set_param debug=+"net neterror"

    lctl ping 172.21.48.250
    12345-0@lo
    12345-172.21.52.88@o2ib
    12345-172.21.48.250@tcp


    LOG file:

    
00000800:00000200:3.0F:1551827094.376143:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a312100] -> 12345-172.21.49.100@tcp (4)
    
00000800:00000200:3.0:1551827094.376155:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.100@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.100:1021
    <http://172.21.49.100:1021>
    
00000800:00000200:3.0:1551827094.376158:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d800 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827094.376312:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827097.376102:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f346f9400] -> 12345-172.21.49.110@tcp (4)
    
00000800:00000200:3.0:1551827097.376110:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.110@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.110:1021
    <http://172.21.49.110:1021>
    
00000800:00000200:3.0:1551827097.376114:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d800 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827097.376135:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32d000] -> 12345-172.21.48.69@tcp (4)
    
00000800:00000200:3.0:1551827097.376140:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.69@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.69:1021
    <http://172.21.48.69:1021>
    
00000800:00000200:3.0:1551827097.376143:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312b00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827097.376199:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32c100] -> 12345-172.21.48.207@tcp (4)
    
00000800:00000200:3.0:1551827097.376206:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.207@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.207:1021
    <http://172.21.48.207:1021>
    
00000800:00000200:3.0:1551827097.376208:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827097.376322:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827097.376339:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827097.376351:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827098.376136:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f49a16c00] -> 12345-172.21.50.66@tcp (4)
    
00000800:00000200:3.0:1551827098.376142:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.50.66@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.50.66:1021
    <http://172.21.50.66:1021>
    
00000800:00000200:3.0:1551827098.376145:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827098.376302:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827099.376211:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a312700] -> 12345-172.21.50.25@tcp (4)
    
00000800:00000200:3.0:1551827099.376217:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.50.25@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.50.25:1021
    <http://172.21.50.25:1021>
    
00000800:00000200:3.0:1551827099.376220:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827099.376350:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827102.376181:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32dc00] -> 12345-172.21.48.230@tcp (4)
    
00000800:00000200:3.0:1551827102.376189:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.230@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.230:1021
    <http://172.21.48.230:1021>
    
00000800:00000200:3.0:1551827102.376192:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827102.376377:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827103.376099:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a313400] -> 12345-172.21.48.151@tcp (4)
    
00000800:00000200:3.0:1551827103.376105:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.151@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.151:1021
    <http://172.21.48.151:1021>
    
00000800:00000200:3.0:1551827103.376108:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827103.376166:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32d600] -> 12345-172.21.48.208@tcp (4)
    
00000800:00000200:3.0:1551827103.376169:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.208@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.208:1021
    <http://172.21.48.208:1021>
    
00000800:00000200:3.0:1551827103.376170:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312b00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827103.376176:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a313a00] -> 12345-172.21.48.210@tcp (4)
    
00000800:00000200:3.0:1551827103.376178:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.210@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.210:1021
    <http://172.21.48.210:1021>
    
00000800:00000200:3.0:1551827103.376179:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d800 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827103.376187:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32c700] -> 12345-172.21.48.12@tcp (4)
    
00000800:00000200:3.0:1551827103.376189:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.12@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.12:1021
    <http://172.21.48.12:1021>
    
00000800:00000200:3.0:1551827103.376190:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32dd00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827103.376288:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827103.376315:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827103.376330:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827103.376345:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827104.376194:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f346f9600] -> 12345-172.21.49.171@tcp (4)
    
00000800:00000200:3.0:1551827104.376201:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.171@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.171:1021
    <http://172.21.49.171:1021>
    
00000800:00000200:3.0:1551827104.376208:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32dd00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827104.376243:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0b75c59c00] -> 12345-172.21.48.150@tcp (4)
    
00000800:00000200:3.0:1551827104.376247:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.150@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.150:1021
    <http://172.21.48.150:1021>
    
00000800:00000200:3.0:1551827104.376249:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d800 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827104.376265:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32da00] -> 12345-172.21.48.60@tcp (4)
    
00000800:00000200:3.0:1551827104.376270:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.60@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.60:1021
    <http://172.21.48.60:1021>
    
00000800:00000200:3.0:1551827104.376272:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312b00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827104.376283:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32d200] -> 12345-172.21.48.68@tcp (4)
    
00000800:00000200:3.0:1551827104.376287:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.68@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.68:1021
    <http://172.21.48.68:1021>
    
00000800:00000200:3.0:1551827104.376290:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827104.376324:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0b75c59a00] -> 12345-172.21.49.25@tcp (4)
    
00000800:00000200:3.0:1551827104.376327:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.25@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.25:1021
    <http://172.21.49.25:1021>
    
00000800:00000200:3.0:1551827104.376329:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827104.376347:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f49a17200] -> 12345-172.21.49.34@tcp (4)
    
00000800:00000200:3.0:1551827104.376357:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.34@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.34:1021
    <http://172.21.49.34:1021>
    
00000800:00000200:3.0:1551827104.376358:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827104.376365:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3d32df00] -> 12345-172.21.49.37@tcp (4)
    
00000800:00000200:3.0:1551827104.376372:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.37@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.37:1021
    <http://172.21.49.37:1021>
    
00000800:00000200:3.0:1551827104.376374:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312300 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827104.376459:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827104.376483:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827104.376503:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827104.376522:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827104.376542:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827104.376599:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827104.376620:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827105.376168:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a312d00] -> 12345-172.21.48.225@tcp (4)
    
00000800:00000200:3.0:1551827105.376174:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.225@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.225:1021
    <http://172.21.48.225:1021>
    
00000800:00000200:3.0:1551827105.376177:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312300 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827105.376210:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a313b00] -> 12345-172.21.49.105@tcp (4)
    
00000800:00000200:3.0:1551827105.376213:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.105@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.105:1021
    <http://172.21.49.105:1021>
    
00000800:00000200:3.0:1551827105.376214:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827105.376289:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827105.376322:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827106.376096:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a312400] -> 12345-172.21.50.22@tcp (4)
    
00000800:00000200:3.0:1551827106.376102:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.50.22@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.50.22:1021
    <http://172.21.50.22:1021>
    
00000800:00000200:3.0:1551827106.376105:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827106.376180:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827107.376127:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a313300] -> 12345-172.21.49.33@tcp (4)
    
00000800:00000200:3.0:1551827107.376135:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.33@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.33:1021
    <http://172.21.49.33:1021>
    
00000800:00000200:3.0:1551827107.376139:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827107.376167:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a312c00] -> 12345-172.21.32.60@tcp (4)
    
00000800:00000200:3.0:1551827107.376171:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.32.60@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.32.60:1021
    <http://172.21.32.60:1021>
    
00000800:00000200:3.0:1551827107.376173:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312300 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827107.376317:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827107.376347:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000400:00000200:3.0:1551827107.898298:0:17198:0:(lib-move.c:4114:lnet_parse())
    TRACE: 172.21.49.233@tcp(172.21.49.233@tcp) <- 172.21.49.100@tcp :
    PUT - for me
    
00000400:00000200:3.0:1551827107.898314:0:17198:0:(lib-ptl.c:571:lnet_ptl_match_md())
    Request from 12345-172.21.49.100@tcp of length 520 into portal 26
    MB=0x5c3e6495f2d10
    
00000400:00000100:3.0:1551827107.898320:0:17198:0:(lib-move.c:3753:lnet_parse_put())
    Dropping PUT from 12345-172.21.49.100@tcp portal 26 match
    1622768724421904 offset 0 length 520: 4
    
00000400:00000200:3.0:1551827107.898323:0:17198:0:(lib-msg.c:816:lnet_is_health_check())
    health check = 0, status = -2, hstatus = 0
    
00000400:00000200:3.0:1551827107.898336:0:17198:0:(lib-move.c:4114:lnet_parse())
    TRACE: 172.21.49.233@tcp(172.21.49.233@tcp) <- 172.21.49.100@tcp :
    PUT - for me
    
00000400:00000200:3.0:1551827107.898340:0:17198:0:(lib-ptl.c:571:lnet_ptl_match_md())
    Request from 12345-172.21.49.100@tcp of length 520 into portal 12
    MB=0x5c3e6495f2d50
    
00000400:00000100:3.0:1551827107.898342:0:17198:0:(lib-move.c:3753:lnet_parse_put())
    Dropping PUT from 12345-172.21.49.100@tcp portal 12 match
    1622768724421968 offset 0 length 520: 4
    
00000400:00000200:3.0:1551827107.898344:0:17198:0:(lib-msg.c:816:lnet_is_health_check())
    health check = 0, status = -2, hstatus = 0
    
00000800:00000200:3.0:1551827108.376065:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a312800] -> 12345-172.21.48.236@tcp (4)
    
00000800:00000200:3.0:1551827108.376072:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.236@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.236:1021
    <http://172.21.48.236:1021>
    
00000800:00000200:3.0:1551827108.376075:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312300 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827108.376124:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a313f00] -> 12345-172.21.49.31@tcp (4)
    
00000800:00000200:3.0:1551827108.376127:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.49.31@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.49.31:1021
    <http://172.21.49.31:1021>
    
00000800:00000200:3.0:1551827108.376128:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827108.376230:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827108.376251:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827109.376098:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f3a313c00] -> 12345-172.21.48.219@tcp (4)
    
00000800:00000200:3.0:1551827109.376106:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.219@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.219:1021
    <http://172.21.48.219:1021>
    
00000800:00000200:3.0:1551827109.376109:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827109.376283:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000400:00000080:3.0:1551827109.465749:0:17286:0:(module.c:90:libcfs_ioctl())
    libcfs ioctl cmd 3221775678
    
00000400:00000200:3.0:1551827109.465771:0:17286:0:(lib-move.c:4783:LNetGet())
    LNetGet -> 12345-172.21.48.250@tcp
    
00000400:00000200:3.0:1551827109.465780:0:17286:0:(lib-move.c:2450:lnet_handle_send_case_locked())
    Source ANY to NMR:  172.21.48.250@tcp local destination
    
00000400:00000200:3.0:1551827109.465785:0:17286:0:(lib-move.c:1714:lnet_handle_send())
    rspt_next_hop_nid = 172.21.48.250@tcp
    
00000400:00000200:3.0:1551827109.465792:0:17286:0:(lib-move.c:1728:lnet_handle_send())
    TRACE: 172.21.49.233@tcp(172.21.49.233@tcp:<?>) ->
    172.21.48.250@tcp(172.21.48.250@tcp:172.21.48.250@tcp
    <mailto:172.21.48.250@tcp:172.21.48.250@tcp>) : GET try# 0
    
00000800:00000200:3.0:1551827109.465798:0:17286:0:(socklnd_cb.c:996:ksocknal_send())
    sending 0 bytes in 0 frags to 12345-172.21.48.250@tcp
    
00000800:00000200:3.0:1551827109.465804:0:17286:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f49a9be00] -> 12345-172.21.48.250@tcp (5)
    
00000800:00000200:3.0:1551827109.465812:0:17286:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.250@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.250:988
    <http://172.21.48.250:988>
    
00000800:00000200:3.0:1551827109.465815:0:17286:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f307a7300 type 2, nob 96 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827109.465874:0:17198:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000400:00000200:3.0:1551827109.465883:0:17198:0:(lib-msg.c:816:lnet_is_health_check())
    health check = 1, status = 0, hstatus = 0
    
00000400:00000200:3.0:1551827109.465887:0:17198:0:(lib-msg.c:630:lnet_health_check())
    health check: 172.21.49.233@tcp->172.21.48.250@tcp: GET: OK
    
00000400:00000200:3.0:1551827109.465896:0:17286:0:(api-ni.c:4045:lnet_ping())
    poll 1(5 0)
    
00000400:00000200:3.0:1551827109.466442:0:17198:0:(lib-move.c:4114:lnet_parse())
    TRACE: 172.21.49.233@tcp(172.21.49.233@tcp) <- 172.21.48.250@tcp :
    REPLY - for me
    
00000400:00000200:3.0:1551827109.466449:0:17198:0:(lib-move.c:3881:lnet_parse_reply())
    172.21.49.233@tcp: Reply from 12345-172.21.48.250@tcp of length
    64/64 into md 0x1d
    
00000400:00000200:3.0:1551827109.466459:0:17198:0:(lib-md.c:69:lnet_md_unlink())
    Unlinking md ffff8e0f37258088
    
00000400:00000200:3.0:1551827109.466460:0:17198:0:(lib-msg.c:816:lnet_is_health_check())
    health check = 0, status = 0, hstatus = 0
    
00000400:00000200:3.0:1551827109.466466:0:17286:0:(api-ni.c:4045:lnet_ping())
    poll 1(3 0) unlinked
    
00000800:00000200:3.0:1551827110.376074:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f37728500] -> 12345-172.21.48.203@tcp (4)
    
00000800:00000200:3.0:1551827110.376084:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.203@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.203:1021
    <http://172.21.48.203:1021>
    
00000800:00000200:3.0:1551827110.376089:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32d900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827110.376124:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f37728f00] -> 12345-172.21.48.11@tcp (4)
    
00000800:00000200:3.0:1551827110.376127:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.11@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.11:1021
    <http://172.21.48.11:1021>
    
00000800:00000200:3.0:1551827110.376129:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312300 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827110.376141:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f37729f00] -> 12345-172.21.48.223@tcp (4)
    
00000800:00000200:3.0:1551827110.376143:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.223@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.223:1021
    <http://172.21.48.223:1021>
    
00000800:00000200:3.0:1551827110.376145:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827110.376207:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f37729900] -> 12345-172.21.48.71@tcp (4)
    
00000800:00000200:3.0:1551827110.376211:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.71@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.71:1021
    <http://172.21.48.71:1021>
    
00000800:00000200:3.0:1551827110.376212:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827110.376279:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827110.376290:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827110.376303:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827110.376322:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000400:00000200:3.0:1551827110.730073:0:17198:0:(lib-move.c:4114:lnet_parse())
    TRACE: 172.21.49.233@tcp(172.21.49.233@tcp) <- 172.21.48.69@tcp :
    PUT - for me
    
00000400:00000200:3.0:1551827110.730085:0:17198:0:(lib-ptl.c:571:lnet_ptl_match_md())
    Request from 12345-172.21.48.69@tcp of length 520 into portal 26
    MB=0x5c4c25c755750
    
00000400:00000100:3.0:1551827110.730091:0:17198:0:(lib-move.c:3753:lnet_parse_put())
    Dropping PUT from 12345-172.21.48.69@tcp portal 26 match
    1623713937446736 offset 0 length 520: 4
    
00000400:00000200:3.0:1551827110.730093:0:17198:0:(lib-msg.c:816:lnet_is_health_check())
    health check = 0, status = -2, hstatus = 0
    
00000400:00000200:3.0:1551827110.730103:0:17198:0:(lib-move.c:4114:lnet_parse())
    TRACE: 172.21.49.233@tcp(172.21.49.233@tcp) <- 172.21.48.69@tcp :
    PUT - for me
    
00000400:00000200:3.0:1551827110.730106:0:17198:0:(lib-ptl.c:571:lnet_ptl_match_md())
    Request from 12345-172.21.48.69@tcp of length 520 into portal 12
    MB=0x5c4c25c755760
    
00000400:00000100:3.0:1551827110.730113:0:17198:0:(lib-move.c:3753:lnet_parse_put())
    Dropping PUT from 12345-172.21.48.69@tcp portal 12 match
    1623713937446752 offset 0 length 520: 4
    
00000400:00000200:3.0:1551827110.730115:0:17198:0:(lib-msg.c:816:lnet_is_health_check())
    health check = 0, status = -2, hstatus = 0
    
00000800:00000200:3.0:1551827111.376066:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f37728c00] -> 12345-172.21.48.72@tcp (4)
    
00000800:00000200:3.0:1551827111.376072:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.72@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.72:1021
    <http://172.21.48.72:1021>
    
00000800:00000200:3.0:1551827111.376075:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3d32cc00 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827111.376119:0:17197:0:(socklnd.c:195:ksocknal_find_peer_locked())
    got peer_ni [ffff8e0f37729600] -> 12345-172.21.48.204@tcp (4)
    
00000800:00000200:3.0:1551827111.376121:0:17197:0:(socklnd_cb.c:757:ksocknal_queue_tx_locked())
    Sending to 12345-172.21.48.204@tcp ip *MailScanner warning:
    numerical links are often malicious:* 172.21.48.204:1021
    <http://172.21.48.204:1021>
    
00000800:00000200:3.0:1551827111.376122:0:17197:0:(socklnd_cb.c:776:ksocknal_queue_tx_locked())
    Packet ffff8e0f3a312900 type 192, nob 24 niov 1 nkiov 0
    
00000800:00000200:3.0:1551827111.376218:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    
00000800:00000200:3.0:1551827111.376230:0:17200:0:(socklnd_cb.c:549:ksocknal_process_transmit())
    send(0) 0
    Debug log: 162 lines, 162 kept, 0 dropped, 0 bad.

    On 3/5/19 2:15 PM, Amir Shehata wrote:
    Take a look at this: https://jira.whamcloud.com/browse/LU-11840
    Let me know if this is the same issue you're seeing.

    On Tue, 5 Mar 2019 at 14:04, Amir Shehata
    <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Riccardo,

        It's not LNet Health. It's Dynamic Discovery. What's
        happening is that 2.12 is discovering all the interfaces on
        the peer. That's why you see all the interfaces in the peer show.

        Multi-Rail doesn't enable o2ib. It just sees it. If the node
        doing the discovery has only tcp, then it should never try to
        connect over the o2ib.

        Are you able to do a "lnetctl ping 172.21.48.250@tcp" from
        the MDS multiple times? Do you see the ping failing
        intermittently?

        What should happen is that when the MDS (running 2.12) tries
        to talk to the peer you have identified, then it'll discover
        its interfaces. But then should realize that it can only
        reach it on the tcp network, since that's the only network
        configured on the MDS.

        It might help, if you just configure LNet only, on the MDS
        and the peer and run a simple
        lctl set_param debug=+"net neterror"
        lnetctl ping <>
        lctl dk >log

        If you can share the debug output, it'll help to pinpoint the
        problem.

        thanks
        amir

        On Tue, 5 Mar 2019 at 12:30, Riccardo Veraldi
        <[email protected]
        <mailto:[email protected]>> wrote:

            I think I figured out the problem.
            My problem is related to Lnet Network Health feature:
            https://jira.whamcloud.com/browse/LU-9120
            the lustre MDS and the lsutre client having same version
            2.12.0
            negotiate a Multi-rail peer connection while this does
            not happen with
            the other clients (2.10.5). So what happens is that both
            IB and tcp are
            being used during transfers.
            tcp is only for connecting to the MDS, IB only to connect
            to the OSS
            anyway Multi-rail is enabled by default between the
            MDS,OSS and client.
            This messes up the situation. the MDS has only one TCP
            interface and
            cannot communicate by IB but in the "lnetctl peer show" a
            NID @o2ib
            shows up and it should not. At this point the MDS tries
            to connect to
            the client using IB and it will never work because there
            is no IB on the
            MDS.
            MDS Lnet configuration:

            net:
                 - net type: lo
                   local NI(s):
                     - nid: 0@lo
                       status: up
                 - net type: tcp
                   local NI(s):
                     - nid: 172.21.49.233@tcp
                       status: up
                       interfaces:
                           0: eth0

            but if I look at lnetctl peer show I See

                - primary nid: 172.21.52.88@o2ib
                   Multi-Rail: True
                   peer ni:
                     - nid: 172.21.48.250@tcp
                       state: NA
                     - nid: 172.21.52.88@o2ib
                       state: NA
                     - nid: 172.21.48.250@tcp1
                       state: NA
                     - nid: 172.21.48.250@tcp2
                       state: NA

            there should be no o2ib nid but Multi-rail for some
            reason enables it.
            I do not have problems with the other clients (non 2.12.0)

            How can I disable Multi-rail on 2.12.0 ??

            thank you



            On 3/5/19 12:14 PM, Patrick Farrell wrote:
            > Riccardo,
            >
            > Since 2.12 is still a relatively new maintenance
            release, it would be helpful if you could open an LU and
            provide more detail there - Such as what clients were
            doing, if you were using any new features (like DoM or
            FLR), and full dmesg from the clients and servers
            involved in these evictions.
            >
            > - Patrick
            >
            > On 3/5/19, 11:50 AM, "lustre-discuss on behalf of
            Riccardo Veraldi"
            <[email protected]
            <mailto:[email protected]> on
            behalf of [email protected]
            <mailto:[email protected]>> wrote:
            >
            >      Hello,
            >
            >      I have quite a big issue on my Lustre 2.12.0 MDS/MDT.
            >
            >      Clients moving data to the OSS occur into a
            locking problem I never met
            >      before.
            >
            >      The clients are mostly 2.10.5 except for one which
            is 2.12.0 but
            >      regardless the client version the problem is still
            there.
            >
            >      So these are the errors I see on hte MDS/MDT. When
            this happens
            >      everything just hangs. If I reboot the MDS
            everything is back to
            >      normality but it happened already 2 times in 3
            days and it is disrupting.
            >
            >      Any hints ?
            >
            >      Is it feasible to downgrade from 2.12.0 to 2.10.6 ?
            >
            >      thanks
            >
            >      Mar  5 11:10:33 psmdsana1501 kernel: Lustre:
            > 7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@
            Request sent has
            >      failed due to network error: [sent 1551813033/real
            1551813033]
            >      req@ffff9fdcbecd0300 x1626845000210688/t0(0)
            >      o104->[email protected].
            <mailto:[email protected].>87@o2ib:15/16 lens
            296/224 e 0 to 1 dl
            >      1551813044 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1
            >      Mar  5 11:10:33 psmdsana1501 kernel: Lustre:
            > 7898:0:(client.c:2132:ptlrpc_expire_one_request())
            Skipped 50552576
            >      previous similar messages
            >      Mar  5 11:13:03 psmdsana1501 kernel: LustreError:
            > 7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ###
            client (nid
            >      172.21.52.87@o2ib) failed to reply to blocking AST
            (req@ffff9fdcbecd0300
            >      x1626845000210688 status 0 rc -110), evict it ns:
            mdt-ana15-MDT0000_UUID
            >      lock: ffff9fde9b6873c0/0x9824623d2148ef38 lrc:
            4/0,0 mode: PR/PR res:
            >      [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc: 5
            type: IBT flags:
            >      0x60200400000020 nid: 172.21.52.87@o2ib remote:
            0xd8efecd6e7621e63
            >      expref: 8 pid: 7898 timeout: 333081 lvb_type: 0
            >      Mar  5 11:13:03 psmdsana1501 kernel: LustreError:
            138-a: ana15-MDT0000:
            >      A client on nid 172.21.52.87@o2ib was evicted due
            to a lock blocking
            >      callback time out: rc -110
            >      Mar  5 11:13:03 psmdsana1501 kernel: LustreError:
            > 5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock
            callback timer
            >      expired after 150s: evicting client at
            172.21.52.87@o2ib ns:
            >      mdt-ana15-MDT0000_UUID lock:
            ffff9fde9b6873c0/0x9824623d2148ef38 lrc:
            >      3/0,0 mode: PR/PR res:
            [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc:
            >      5 type: IBT flags: 0x60200400000020 nid:
            172.21.52.87@o2ib remote:
            >      0xd8efecd6e7621e63 expref: 9 pid: 7898 timeout: 0
            lvb_type: 0
            >      Mar  5 11:13:04 psmdsana1501 kernel: Lustre:
            ana15-MDT0000: Connection
            >      restored to 59c5a826-f4e9-0dd0-8d4f-08c204f25941
            (at 172.21.52.87@o2ib)
            >      Mar  5 11:15:34 psmdsana1501 kernel: LustreError:
            > 7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ###
            client (nid
            >      172.21.52.142@o2ib) failed to reply to blocking AST
            >      (req@ffff9fde2d393600 x1626845000213776 status 0
            rc -110), evict it ns:
            >      mdt-ana15-MDT0000_UUID lock:
            ffff9fde9b6858c0/0x9824623d2148efee lrc:
            >      4/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0
            bits 0x13/0x0 rrc: 3
            >      type: IBT flags: 0x60200400000020 nid:
            172.21.52.142@o2ib remote:
            >      0xbb35541ea6663082 expref: 9 pid: 7898 timeout:
            333232 lvb_type: 0
            >      Mar  5 11:15:34 psmdsana1501 kernel: LustreError:
            138-a: ana15-MDT0000:
            >      A client on nid 172.21.52.142@o2ib was evicted due
            to a lock blocking
            >      callback time out: rc -110
            >      Mar  5 11:15:34 psmdsana1501 kernel: LustreError:
            > 5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock
            callback timer
            >      expired after 151s: evicting client at
            172.21.52.142@o2ib ns:
            >      mdt-ana15-MDT0000_UUID lock:
            ffff9fde9b6858c0/0x9824623d2148efee lrc:
            >      3/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0
            bits 0x13/0x0 rrc: 3
            >      type: IBT flags: 0x60200400000020 nid:
            172.21.52.142@o2ib remote:
            >      0xbb35541ea6663082 expref: 10 pid: 7898 timeout: 0
            lvb_type: 0
            >      Mar  5 11:15:34 psmdsana1501 kernel: Lustre:
            ana15-MDT0000: Connection
            >      restored to 9d49a115-646b-c006-fd85-000a4b90019a
            (at 172.21.52.142@o2ib)
            >      Mar  5 11:20:33 psmdsana1501 kernel: Lustre:
            > 7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@
            Request sent has
            >      failed due to network error: [sent 1551813633/real
            1551813633]
            >      req@ffff9fdcc2a95100 x1626845000222624/t0(0)
            >      o104->[email protected].
            <mailto:[email protected].>87@o2ib:15/16 lens
            296/224 e 0 to 1 dl
            >      1551813644 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1
            >      Mar  5 11:20:33 psmdsana1501 kernel: Lustre:
            > 7898:0:(client.c:2132:ptlrpc_expire_one_request())
            Skipped 23570550
            >      previous similar messages
            >      Mar  5 11:22:46 psmdsana1501 kernel: LustreError:
            > 7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ###
            client (nid
            >      172.21.52.87@o2ib) failed to reply to blocking AST
            (req@ffff9fdcc2a95100
            >      x1626845000222624 status 0 rc -110), evict it ns:
            mdt-ana15-MDT0000_UUID
            >      lock: ffff9fde86ffdf80/0x9824623d2148f23a lrc:
            4/0,0 mode: PR/PR res:
            >      [0x2000013ae:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3
            type: IBT flags:
            >      0x60200400000020 nid: 172.21.52.87@o2ib remote:
            0xd8efecd6e7621eb7
            >      expref: 9 pid: 7898 timeout: 333665 lvb_type: 0
            >      Mar  5 11:22:46 psmdsana1501 kernel: LustreError:
            138-a: ana15-MDT0000:
            >      A client on nid 172.21.52.87@o2ib was evicted due
            to a lock blocking
            >      callback time out: rc -110
            >      Mar  5 11:22:46 psmdsana1501 kernel: LustreError:
            > 5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock
            callback timer
            >      expired after 150s: evicting client at
            172.21.52.87@o2ib ns:
            >      mdt-ana15-MDT0000_UUID lock:
            ffff9fde86ffdf80/0x9824623d2148f23a lrc:
            >      3/0,0 mode: PR/PR res: [0x2000013ae:0x1:0x0].0x0
            bits 0x13/0x0 rrc: 3
            >      type: IBT flags: 0x60200400000020 nid:
            172.21.52.87@o2ib remote:
            >      0xd8efecd6e7621eb7 expref: 10 pid: 7898 timeout: 0
            lvb_type: 0
            >
            >
            > _______________________________________________
            >      lustre-discuss mailing list
            > [email protected]
            <mailto:[email protected]>
            >
            http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
            >
            >
            > _______________________________________________
            > lustre-discuss mailing list
            > [email protected]
            <mailto:[email protected]>
            >
            http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


            _______________________________________________
            lustre-discuss mailing list
            [email protected]
            <mailto:[email protected]>
            http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to