Hi,

We are having some Lnet issues, and wonder if anyone can advise.

Client is 2.15.5, server is 2.12.6.

Fabric is IB.

The file system mounts, but OSTs on a couple of OSSs are not contactable.

Client and servers can ping each other over the IB network.

However, a lnetctl ping fails to/from the bad OSSs to this client. To other clients it's all fine.

i.e. for most of the clients it is working well, just one or two not so.

Server to client:
lnetctl ping 172.18.178.201@o2ib
manage:
    - ping:
          errno: -1
          descr: failed to ping 172.18.178.201@o2ib: Input/output error

Client to server:
anage:
    - ping:
          errno: -1
          descr: failed to ping 172.18.185.10@o2ib: Input/output error



And the o2ib network is noted as down:
lnetctl net show --net o2ib --verbose
net:
    - net type: o2ib
      local NI(s):
        - nid: 172.18.178.216@o2ib
          status: down
          interfaces:
              0: ibs1f0
          statistics:
              send_count: 45032
              recv_count: 45030
              drop_count: 0
          tunables:
              peer_timeout: 100
              peer_credits: 32
              peer_buffer_credits: 0
              credits: 256
          lnd tunables:
              peercredits_hiw: 16
              map_on_demand: 1
              concurrent_sends: 32
              fmr_pool_size: 512
              fmr_flush_trigger: 384
              fmr_cache: 1
              ntx: 512
              conns_per_peer: 1
          dev cpt: 0
          CPT: "[0,1]"



Could this be a hardware error, even though the IB is working?

Could it be related to https://jira.whamcloud.com/browse/LU-16378 ?

Are there any suggestions on how to bring up the lnet network or fix the problems?

Thanks,
Alastair.
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to