Hi Folks, One of our filesystems seemed to fail over the holiday weekend - we're running DNE and MDT0001 won't mount. At first it looked like we'd run out of space (rc = -28) but then we were seeing this
mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 failed: File exists retries left: 0 mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 failed: File exists possibly kernel: LustreError: 13921:0:(genops.c:478:class_register_device()) astrofs-OST0000-osc-MDT0001: already exists, won't add lustre_rmmod wouldn't remove everything cleanly (osc in use) and so after a reboot everything *seemed* to start OK [root@astrofs-mds1 ~]# mount -t lustre /dev/mapper/MGS on /lustre/MGS type lustre (ro) /dev/mapper/MDT0000 on /lustre/astrofs-MDT0000 type lustre (ro) /dev/mapper/MDT0001 on /lustre/astrofs-MDT0001 type lustre (ro) ... but not for long kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG possibly corrupt llog? I see LU-12674 which looks like our problem, but only backported to 2.12 branch (these servers are still 2.10.8) Piecing together what *might* have happened is a user possibly ran out of inodes and then did a rm -r before the system stopped responding. Mounting just now I'm getting: [ 1985.078422] LustreError: 10953:0:(llog.c:654:llog_process_thread()) astrofs-OST0001-osc-MDT0001: Local llog found corrupted #0x7ede0:1:0 plain index 35518 count 2 [ 1985.095129] LustreError: 10959:0:(llog_osd.c:961:llog_osd_next_block()) astrofs-MDT0001-osd: invalid llog tail at log id [0x7ef40:0x1:0x0]:0 offset 577536 bytes 4096 [ 1985.109892] LustreError: 10959:0:(osp_sync.c:1242:osp_sync_thread()) astrofs-OST0004-osc-MDT0001: llog process with osp_sync_process_queues failed: -22 [ 1985.126797] LustreError: 10973:0:(llog_cat.c:269:llog_cat_id2handle()) astrofs-OST000b-osc-MDT0001: error opening log id [0x7ef76:0x1:0x0]:0: rc = -2 [ 1985.140169] LustreError: 10973:0:(llog_cat.c:823:llog_cat_process_cb()) astrofs-OST000b-osc-MDT0001: cannot find handle for llog [0x7ef76:0x1:0x0]: rc = -2 [ 1985.155321] Lustre: astrofs-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 1985.169404] Lustre: astrofs-MDT0001: in recovery but waiting for the first client to connect [ 1985.177869] Lustre: astrofs-MDT0001: Will be in recovery for at least 2:30, or until 1508 clients reconnect [ 1985.187612] Lustre: astrofs-MDT0001: Connection restored to a5e41149-73fc-b60a-30b1-da096a5c2527 (at 1170@gni1) [ 2017.251374] Lustre: astrofs-MDT0001: Connection restored to 7a388f58-bc16-6bd7-e0c8-4ffa7c0dd305 (at 400@gni1) [ 2017.261374] Lustre: Skipped 1275 previous similar messages [ 2081.458117] Lustre: astrofs-MDT0001: Connection restored to 10.10.36.143@o2ib4 (at 10.10.36.143@o2ib4) [ 2081.467419] Lustre: Skipped 277 previous similar messages [ 2082.324547] Lustre: astrofs-MDT0001: Recovery over after 1:37, of 1508 clients 1508 recovered and 0 were evicted. Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG [ 2082.392381] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) failed: [ 2082.401422] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG [ 2082.408558] Pid: 11082, comm: orph_cleanup_as 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon May 27 03:45:37 UTC 2019 [ 2082.418891] Call Trace: [ 2082.421340] [<ffffffffc0af07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 2082.427890] [<ffffffffc0af087c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 2082.434077] [<ffffffffc1694159>] osp_sync_declare_add+0x3a9/0x3e0 [osp] [ 2082.440797] [<ffffffffc1683299>] osp_declare_destroy+0xc9/0x1c0 [osp] [ 2082.447338] [<ffffffffc15e0c6e>] lod_sub_declare_destroy+0xce/0x2d0 [lod] [ 2082.454237] [<ffffffffc15c54a5>] lod_obj_stripe_destroy_cb+0x85/0x90 [lod] [ 2082.461213] [<ffffffffc15d0ac6>] lod_obj_for_each_stripe+0xb6/0x230 [lod] [ 2082.468104] [<ffffffffc15d184b>] lod_declare_destroy+0x43b/0x5c0 [lod] [ 2082.474736] [<ffffffffc1648896>] orph_key_test_and_del+0x5f6/0xd30 [mdd] [ 2082.481538] [<ffffffffc1649587>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] [ 2082.488250] [<ffffffffa7cc1c31>] kthread+0xd1/0xe0 [ 2082.493147] [<ffffffffa8374c1d>] ret_from_fork_nospec_begin+0x7/0x21 [ 2082.499601] [<ffffffffffffffff>] 0xffffffffffffffff [ 2082.504585] Kernel panic - not syncing: LBUG e2fsck when mounted as lfiskfs seems to be clean, but is there a way I can get it mounted enough to run lfsck? Alternatively, can I upgrade the MDSs to 2.12.x while having the OSSs still on 2.10? yes I know this isn't ideal but I wasn't planning a large upgrade at zero notice to our users (also, we still have a legacy system accessing it with a 2.7 client - it's replacement arrived last Sept, but still hasn't been handed over to us yet, so I really don't want to get too out of step) Many thanks Andrew _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
