Hi Colin,
Yes I can confirm that we are back online. Thank you so much for the tip. Cheers, sb. Colin Faber <[email protected]> writes: > yeah I don't know how successful lctl --device abort_recovery is going to > be vs abort_recov on the device itself, I think probably by the time you > get to aborting it via lctl it's already too late. > > But to confirm you're back online again? (Also, time to upgrade!) =) > > On Tue, Nov 10, 2020 at 7:57 AM <[email protected]> wrote: > >> >> Hi Colin, >> >> Thank you. That was the tip I needed! >> >> We are running IEEL so I did the following... >> >> * mount the mdt by hand with -o abort_recov >> >> mount -v -t lustre -o abort_recov /dev/mapper/mpatha >> /mnt/lustre02-MDT0000 >> >> * after it mounted up, umount it >> * start the mdt via IEEL >> * mount the file system on clients. >> >> I also tried to start the mdt with IEEL then use >> >> lctl --device 4 abort_recovery >> >> but that didn't work. >> >> Cheers, >> >> sb. Scott Blomquist >> >> >> Colin Faber <[email protected]> writes: >> >> > Scott, >> > >> > Have you tried aborting recovery on mount? >> > >> > On Mon, Nov 9, 2020 at 1:15 PM <[email protected]> wrote: >> > >> >> >> >> Hi All, >> >> >> >> After the recent power glitch last week one of our lustre file systems >> >> failed to come up. >> >> >> >> We diagnosed the problem down to a file system error on the MDT. This >> >> is an old IEEL systems running on Dell equipment. >> >> >> >> Here are the facts... >> >> >> >> * the raid 6 array running on an Dell MD32xx is ok. >> >> >> >> * when we bring up the MDT it goes read-only then the MDS host crashes >> >> >> >> * after this the MDT file system is dirty and we have to e2fsck it >> >> >> >> * I have tried multiple combinations of MDS up/down, OSS up/down with >> >> nothing changing the results. >> >> >> >> * This seems to be lustre 2.7.15 >> >> >> >> I think this may be >> >> >> >> https://jira.whamcloud.com/browse/LU-7045 >> >> >> >> or something like that. >> >> >> >> Is there a way to LFSCK (or something) this error away? Or is this a >> >> please update lustre error. >> >> >> >> Thanks for any help. >> >> >> >> I have attached the error below. >> >> >> >> Thanks for any insight, >> >> >> >> sb. Scott Blomquist >> >> >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: ------------[ cut here >> >> ]------------ >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: WARNING: at >> >> >> /tmp/rpmbuild-lustre-jenkins-U6NXEPsD/BUILD/lustre-2.7.15.3/ldiskfs/ext4_jbd2.c:266 >> >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]() >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Modules linked in: >> osp(OE) >> >> mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) >> >> ldiskfs(OE) lquota(OE) vfat fat usb_storage mpt3sas mptctl mptbase >> dell_rbu >> >> lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) >> ptlrpc(OE) >> >> obdclass(OE) lnet(OE) sha512_generic fuse crypto_null libcfs(OE) >> >> rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) >> >> ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) vxlan >> >> ip6_udp_tunnel udp_tunnel intel_powerclamp coretemp intel_rapl kvm_intel >> >> kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul >> glue_helper >> >> ablk_helper iTCO_wdt dcdbas cryptd iTCO_vendor_support dm_round_robin >> >> pcspkr sg ipmi_devintf sb_edac edac_core acpi_power_meter ntb ipmi_si wm >> >> i shpchp acpi_pad ipmi_msghandler lpc_ich mei_me mei mfd_core >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: knem(OE) nfsd >> auth_rpcgss >> >> nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2 >> >> mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) sr_mod cdrom >> >> sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect >> >> sysimgblt i2c_algo_bit drm_kms_helper crct10dif_pclmul crct10dif_common >> >> mpt2sas crc32c_intel ttm ahci raid_class drm libahci scsi_transport_sas >> >> mlx4_core(OE) mlx_compat(OE) libata tg3 i2c_core ptp megaraid_sas >> pps_core >> >> dm_mirror dm_region_hash dm_log dm_mod >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: CPU: 10 PID: 6577 Comm: >> >> mdt01_003 Tainted: G OE ------------ >> >> 3.10.0-327.el7_lustre.gd4cb884.x86_64 #1 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Hardware name: Dell Inc. >> >> PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: 0000000000000000 >> >> 00000000fff8486a ffff881f815234f0 ffffffff81635429 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: ffff881f81523528 >> >> ffffffff8107b200 ffff880fc708f1a0 ffff880fe535b060 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: ffff881fa4aff7c8 >> >> ffffffffa10b9a9c 0000000000000325 ffff881f81523538 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Call Trace: >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff81635429>] >> >> dump_stack+0x19/0x1b >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff8107b200>] >> >> warn_slowpath_common+0x70/0xb0 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff8107b34a>] >> >> warn_slowpath_null+0x1a/0x20 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa1057052>] >> >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa10a88c1>] >> >> ldiskfs_getblk+0x131/0x200 [ldiskfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa10a89b7>] >> >> ldiskfs_bread+0x27/0xc0 [ldiskfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa11ff069>] >> >> osd_ldiskfs_write_record+0x169/0x360 [osd_ldiskfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa11ff358>] >> >> osd_write+0xf8/0x230 [osd_ldiskfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0952cd5>] >> >> dt_record_write+0x45/0x130 [obdclass] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0c0e6c2>] >> >> tgt_last_rcvd_update+0x732/0xef0 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa033c1f8>] ? >> >> start_this_handle+0xa8/0x5d0 [jbd2] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0c13542>] >> >> tgt_txn_stop_cb+0x1a2/0x4a0 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff811c115e>] ? >> >> kmem_cache_alloc_trace+0x1ce/0x1f0 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0952c23>] >> >> dt_txn_hook_stop+0x63/0x80 [obdclass] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa11dc9f2>] >> >> osd_trans_stop+0x112/0x3d0 [osd_ldiskfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa11daa9a>] ? >> >> osd_trans_start+0x1ba/0x670 [osd_ldiskfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa1338bf9>] >> >> mdt_empty_transno+0x109/0x790 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa133baed>] >> >> mdt_mfd_open+0x91d/0xeb0 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa133c5fb>] >> >> mdt_finish_open+0x57b/0x9d0 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa133d8a0>] >> >> mdt_reint_open+0xe50/0x2e00 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0bb1bc7>] ? >> >> lustre_msg_add_version+0x27/0xa0 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0baf670>] ? >> >> lustre_msg_buf_v2+0x1b0/0x1b0 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa096f3ee>] ? >> >> lu_ucred+0x1e/0x30 [obdclass] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff812fbd92>] ? >> >> strlcpy+0x42/0x60 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa1331ce0>] >> >> mdt_reint_rec+0x80/0x210 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa1313699>] >> >> mdt_reint_internal+0x5d9/0xb40 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa1313d62>] >> >> mdt_intent_reint+0x162/0x420 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0baf687>] ? >> >> lustre_msg_buf+0x17/0x60 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa1317655>] >> >> mdt_intent_opc+0x215/0x9b0 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0bb3d90>] ? >> >> lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa131e6f8>] >> >> mdt_intent_policy+0x138/0x320 [mdt] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0b641f7>] >> >> ldlm_lock_enqueue+0x357/0x9c0 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0b8a772>] >> >> ldlm_handle_enqueue0+0x4f2/0x16f0 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0bb3e10>] ? >> >> lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0c16762>] >> >> tgt_enqueue+0x62/0x210 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0c1b7cb>] >> >> tgt_request_handle+0x8fb/0x11f0 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0bbe96b>] >> >> ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa07c7d98>] ? >> >> lc_watchdog_touch+0x68/0x180 [libcfs] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0bbba38>] ? >> >> ptlrpc_wait_event+0x98/0x330 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff810af018>] ? >> >> __wake_up_common+0x58/0x90 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0bc2290>] >> >> ptlrpc_main+0xc00/0x1f50 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffffa0bc1690>] ? >> >> ptlrpc_register_service+0x1070/0x1070 [ptlrpc] >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff810a5aef>] >> >> kthread+0xcf/0xe0 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff810a5a20>] ? >> >> kthread_create_on_node+0x140/0x140 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff81645a98>] >> >> ret_from_fork+0x58/0x90 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [<ffffffff810a5a20>] ? >> >> kthread_create_on_node+0x140/0x140 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: ---[ end trace >> >> 522ffb7aaa9346b5 ]--- >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LDISKFS-fs: >> >> ldiskfs_getblk:805: aborting transaction: error 28 in >> >> __ldiskfs_handle_dirty_metadata >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LDISKFS-fs error (device >> >> dm-1): ldiskfs_getblk:805: inode #85: block 607448: comm mdt01_003: >> >> journal_dirty_metadata failed: handle type 0 started at line 1141, >> credits >> >> 4/0, errcode -28 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Aborting journal on >> device >> >> dm-1-8. >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LDISKFS-fs (dm-1): >> >> Remounting filesystem read-only >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LustreError: >> >> 6577:0:(osd_io.c:1655:osd_ldiskfs_write_record()) dm-1: error reading >> >> offset 45056 (block 11): rc = -28 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LustreError: >> >> 6577:0:(tgt_lastrcvd.c:1191:tgt_last_rcvd_update()) lustre02-MDT0000: >> can't >> >> update reply_data file: rc = -28 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LustreError: >> >> 6577:0:(osd_handler.c:1219:osd_trans_stop()) lustre02-MDT0000-osd: >> failed >> >> in transaction hook: rc = -28 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LDISKFS-fs error (device >> >> dm-1) in osd_trans_stop:1225: error 28 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LustreError: >> >> 6052:0:(osd_handler.c:993:osd_trans_commit_cb()) transaction >> >> @0xffff880fcde5c180 commit error: 2 >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: LustreError: >> >> 6577:0:(osd_handler.c:1228:osd_trans_stop()) lustre02-MDT0000-osd: >> failed >> >> to stop transaction: rc = -28 >> >> _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
