Here at the University of Groningen we run a Lustre setup that has some issues in client-nodes being evicted by the metadata-server:
Kernel: CentOS 7.5 3.10.0-862.2.3-lustre Lustre: 2.10.4 Network IB/10 Gb Ethernet logs client: Dec 19 06:45:28 dh-node03 kernel: [1952901.506173] LustreError: 11-0: dh3-MDT0000-mdc-ffff9f337a935000: operation ldlm_enqueue to node 172.23.53.205@o2ib3 failed: rc = -107 Dec 19 06:45:28 dh-node03 kernel: [1952901.508610] Lustre: dh3-MDT0000-mdc-ffff9f337a935000: Connection to dh3-MDT0000 (at 172.23.53.205@o2ib3) was lost; in progress operations using this service will wait for recovery to complete Dec 19 06:45:28 dh-node03 kernel: [1952901.559429] LustreError: 167-0: dh3-MDT0000-mdc-ffff9f337a935000: This client was evicted by dh3-MDT0000; in progress operations using this service will fail. Dec 19 06:45:28 dh-node03 kernel: [1952901.559678] LustreError: 29373:0:(file.c:172:ll_close_inode_openhandle()) dh3-clilmv-ffff9f337a935000: inode [0x200009e9e:0xfabd:0x0] mdc close failed: rc = -5 Dec 19 06:45:28 dh-node03 kernel: [1952901.559681] LustreError: 29373:0:(file.c:172:ll_close_inode_openhandle()) Skipped 1 previous similar message Dec 19 06:45:28 dh-node03 kernel: [1952901.594335] LustreError: 27096:0:(lmv_obd.c:1250:lmv_fid_alloc()) Can't alloc new fid, rc -19 Dec 19 06:45:28 dh-node03 kernel: [1952901.627102] LustreError: 29477:0:(file.c:3644:ll_inode_revalidate_fini()) dh3: revalidate FID [0x200009e9e:0xef54:0x0] error: rc = -108 Dec 19 06:45:29 dh-node03 kernel: [1952902.316568] LustreError: 29373:0:(file.c:172:ll_close_inode_openhandle()) dh3-clilmv-ffff9f337a935000: inode [0x200009e9e:0xfb45:0x0] mdc close failed: rc = -108 Dec 19 06:45:29 dh-node03 kernel: [1952902.318931] LustreError: 29373:0:(file.c:172:ll_close_inode_openhandle()) Skipped 7 previous similar messages logs on metadata-server: Dec 19 06:45:28 dh3-mds01 kernel: LustreError: 3883:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid 172.23.53.3@o2ib3) faile d to reply to blocking AST (req@ffff9887dc085100 x1606923837274448 status 0 rc -110), evict it ns: mdt-dh3-MDT0000_UUID lock: ffff988c196 02800/0x7477c37a8da4a564 lrc: 4/0,0 mode: PR/PR res: [0x200009e9e:0xfb4c:0x0].0x0 bits 0x20 rrc: 3 type: IBT flags: 0x60200400000020 nid: 172.23.53.3@o2ib3 remote: 0xea19f3efd5f14578 expref: 5753603 pid: 44838 timeout: 17016180683 lvb_type: 0 Dec 19 06:45:28 dh3-mds01 kernel: LustreError: 138-a: dh3-MDT0000: A client on nid 172.23.53.3@o2ib3 was evicted due to a lock blocking c allback time out: rc -110 Dec 19 06:45:28 dh3-mds01 kernel: Lustre: dh3-MDT0000: Connection restored to ad5824f9-f876-01c5-a14b-5a22ddabed41 (at 172.23.53.3@o2ib3) Dec 19 06:48:49 dh3-mds01 kernel: LNet: Service thread pid 3883 was inactive for 200.66s. The thread might be hung, or it might only be s low and will resume later. Dumping the stack trace for debugging purposes: Dec 19 06:48:49 dh3-mds01 kernel: LNet: 3187:0:(linux-debug.c:185:libcfs_call_trace()) can't show stack: kernel doesn't export show_task Dec 19 06:48:49 dh3-mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1545198529.3883 Dec 19 06:50:09 dh3-mds01 kernel: LNet: Service thread pid 172356 was inactive for 200.15s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Dec 19 06:50:09 dh3-mds01 kernel: LNet: 3187:0:(linux-debug.c:185:libcfs_call_trace()) can't show stack: kernel doesn't export show_task Dec 19 06:50:09 dh3-mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1545198609.172356 Dec 19 06:50:28 dh3-mds01 kernel: LustreError: 3883:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1545198328, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dh3-MDT0000_UUID lock: ffff988c3f913c00/0x 7477c37a8da4e2c0 lrc: 3/0,1 mode: --/EX res: [0x200009e9e:0xfb4c:0x0].0x0 bits 0x21 rrc: 3 type: IBT flags: 0x40210000000000 nid: local r emote: 0x0 expref: -99 pid: 3883 timeout: 0 lvb_type: 0 Dec 19 06:50:28 dh3-mds01 kernel: LustreError: dumping log to /tmp/lustre-log.1545198628.3883 Dec 19 06:51:49 dh3-mds01 kernel: LustreError: 172356:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1545198409, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-dh3-MDT0000_UUID lock: ffff988c29834400/ 0x7477c37a8df47a1c lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2 rrc: 3 type: IBT flags: 0x40210000000000 nid: local rem ote: 0x0 expref: -99 pid: 172356 timeout: 0 lvb_type: 0 Possible we need to do some LNet tuning.. currently we don't have any tuning set on clients/metadata/oss. Any pointers/hint/tricks/tips to point us in the right direction will be much appreciated! -- Vriendelijke groet, Ger StrikwerdaChef Special Rijksuniversiteit Groningen Centrum voor Informatie Technologie Team HPC Beheer Smitsborg Nettelbosje 1 9747 AJ Groningen Tel. 050 363 9276 "God is hard, God is fair some men he gave brains, others he gave hair"
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
