Hi everyone,

a v2.11.0 installation, there is an ongoing file transfer in Lustre running
on one of the client nodes. Every now and then the process stops with I/O
errors and then the file systems doesn't respond. After a while the access
to the FS resumes.

`lctl get_param *.*.state` shows a history of evictions of the targets:

mdc.scratch-MDT0000-mdc-ffff8ecec4bf2800.state=
> current_state: FULL
> state_history:
>  - [ 1546444758, CONNECTING ]
>  - [ 1546444758, FULL ]
>  <...>
>  - [ 1546594365, DISCONN ]
>  - [ 1546594388, CONNECTING ]
>  - [ 1546594388, RECOVER ]
>  - [ 1546594388, FULL ]
>
> mgc.MGC10.149.0.183@o2ib.state=
> current_state: FULL
> state_history:
>  - [ 1546444756, CONNECTING ]
>  - [ 1546444756, FULL ]
>  <...>
>  - [ 1546594354, DISCONN ]
>  - [ 1546594363, CONNECTING ]
>  - [ 1546594363, EVICTED ]
>  - [ 1546594363, RECOVER ]
>
- [ 1546594363, FULL ]
> osc.scratch-OST0000-osc-ffff8ecec4bf2800.state=
> current_state: CONNECTING
> state_history:
>  - [ 1548316546, DISCONN ]
>  - [ 1548316551, CONNECTING ]
>  <...>
>  - [ 1548316851, CONNECTING ]
>  - [ 1548316921, DISCONN ]
>  - [ 1548316926, CONNECTING ]
>  - [ 1548316996, DISCONN ]
>  - [ 1548317001, CONNECTING ]
>  - [ 1548317071, DISCONN ]
>  - [ 1548317076, CONNECTING ]
>

and so on for all the affected by the file transfer OSTs.

In the OSS's logs there are all sorts of connectivity problems reported
both by Lustre:

> ods01: Jan 24 08:30:17 ods01 kernel: LustreError:
> 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from
> 10.149.255.238@o2ib arrived at 1548314992 with bad export cookie
> 13579970295878280357
> ods01: Jan 24 08:30:17 ods01 kernel: LustreError:
> 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) Skipped 919 previous
> similar messages
> ods01: Jan 24 08:32:25 ods01 kernel: LustreError:
> 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer
> expired after 150s: evicting client at 10.149.255.238@o2ib  ns:
> filter-scratch-OST0003_UUID lock: ffff888fa2924800/0xbc75bf08ca486964 lrc:
> 4/0,0 mode: PW/PW res: [0x19b6e36:0x0:0x0].0x0 rrc: 3 type: EXT
> [0->18446744073709551615] (req 716800->720895) flags: 0x60000400020020 nid:
> 10.149.255.238@o2ib remote: 0xe347df981d48a71c expref: 55885 pid: 38849
> timeout: 3689597 lvb_type: 0
> ods01: Jan 24 08:32:25 ods01 kernel: LustreError:
> 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) Skipped 6 previous similar
> messages
> ods01: Jan 24 08:33:15 ods01 kernel: LustreError: dumping log to
> /tmp/lustre-log.1548315195.186010
> ods01: Jan 24 08:44:41 ods01 kernel: Lustre:
> 186010:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1548314995/real 1548315881]
> req@ffff88975cc9d100 x1619663992901664/t0(0)
> o104->scratch-OST0002@10.149.255.238@o2ib:15/16 lens 296/224 e 0 to 1 dl
> 1548315006 ref 1 fl Rpc:eXSN/0/ffffffff rc -11/-1
>

and LNet:

> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Timed out tx:
> tx_queue, 1 seconds
> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Skipped 13 previous
> similar messages
> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Timed out RDMA with
> 10.149.255.254@o2ib (52): c:
> 0, oc: 0, rc: 63
> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Skipped 13 previous
> similar messages
>

>From the client side:

> Jan 24 05:43:03 admin kernel: LustreError: 133-1:
> scratch-OST0001-osc-ffff8ecec4bf2800: BAD READ CHECKSUM: from
> 10.149.0.185@o2ib via 0@<0:0> inode [0x200002366:0x160f1:0x0] object
> 0x0:4243598 extent [0-28671], client 0, server ac9ffb9, cksum_type 4
> Jan 24 05:43:03 admin kernel: LustreError:
> 3867:0:(osc_request.c:1681:osc_brw_redo_request()) @@@ redo for recoverable
> error -11  req@ffff8ecb19cd0000 x1621600181397872/t0(0)
> o3->scratch-OST0001-osc-ffff8ecec4bf2800@10.149.0.185@o2ib:6/4 lens
> 608/408 e 1 to 1 dl 1548304253 ref 2 fl Interpret:ReXM/0/0 rc 26860/26860
>

and on another client:

> Jan 24 09:03:31 c055 kernel: Lustre:
> 16641:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for slow reply: [sent 1548316966/real 1548316966]
> req@ffff995a6dab4e00 x1623278997809600/t0(0)
> o101->scratch-OST0024-osc-ffff995226fca000@10.149.0.191@o2ib:28/4 lens
> 3584/400 e 0 to 1 dl 1548317011 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
> Jan 24 09:03:31 c055 kernel: Lustre:
> 16641:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 2324 previous
> similar messages
>


`opareport` reports no interconnect errors in the whole fabric. The problem
is present on different nodes in the cluster and when accessing different
OSTs, attached to different OSSs.

Both Lustre servers and clients are running v2.11.0, ZFS v0.7.9.
Interconnect: Omni-Path.


Based on the problem, looks very much like a bug (in LNet?). What
additional information might help to pinpoint the root cause?




Kind regards,
Konstantin Kurochkin
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to