Have you tried resilvering the pool? On Wed, Mar 15, 2023, 11:57 AM Mountford, Christopher J. (Dr.) via lustre-discuss <[email protected]> wrote:
> I'm hoping someone offer some suggestions. > > We have a problem on our production Lustre/ZFS filesystem (CentOS 7, ZFS > 0.7.13, Lustre 2.12.9), so far I've drawn a blank trying to track down the > cause of this. > > We see the following zfs panic message in the logs (in every case the > VERIFY3/panic lines are identical): > > > Mar 15 17:15:39 amds01a kernel: VERIFY3(sa.sa_magic == 0x2F505A) failed (8 > == 3100762) > Mar 15 17:15:39 amds01a kernel: PANIC at > zfs_vfsops.c:584:zfs_space_delta_cb() > Mar 15 17:15:39 amds01a kernel: Showing stack for process 15381 > Mar 15 17:15:39 amds01a kernel: CPU: 31 PID: 15381 Comm: mdt00_020 > Tainted: P OE ------------ 3.10.0-1160.49.1.el7_lustre.x86_64 > #1 > Mar 15 17:15:39 amds01a kernel: Hardware name: HPE ProLiant DL360 > Gen10/ProLiant DL360 Gen10, BIOS U32 02/09/2023 > Mar 15 17:15:39 amds01a kernel: Call Trace: > Mar 15 17:15:39 amds01a kernel: [<ffffffff99d83539>] dump_stack+0x19/0x1b > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76f24>] > spl_dumpstack+0x44/0x50 [spl] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76ff9>] spl_panic+0xc9/0x110 > [spl] > Mar 15 17:15:39 amds01a kernel: [<ffffffff996e482c>] ? > update_curr+0x14c/0x1e0 > Mar 15 17:15:39 amds01a kernel: [<ffffffff99707cf4>] ? > getrawmonotonic64+0x34/0xc0 > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c87aa3>] ? > dmu_zfetch+0x393/0x520 [zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c6a073>] ? > dbuf_rele_and_unlock+0x283/0x4c0 [zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b78ff1>] ? __cv_init+0x41/0x60 > [spl] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0d0f53c>] > zfs_space_delta_cb+0x9c/0x200 [zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c7a944>] > dmu_objset_userquota_get_ids+0x154/0x440 [zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c89e98>] > dnode_setdirty+0x38/0xf0 [zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c8a21c>] > dnode_allocate+0x18c/0x230 [zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c76d2b>] > dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d73052>] > __osd_object_create+0x82/0x170 [osd_zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d7ce23>] ? > osd_declare_xattr_set+0xb3/0x190 [osd_zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d733bd>] osd_mkreg+0x7d/0x210 > [osd_zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffff99828f01>] ? > __kmalloc_node+0x1d1/0x2b0 > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d6f8f6>] > osd_create+0x336/0xb10 [osd_zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc2016fb5>] > lod_sub_create+0x1f5/0x480 [lod] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc2007729>] lod_create+0x69/0x340 > [lod] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d65690>] ? > osd_trans_create+0x410/0x410 [osd_zfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc2081993>] > mdd_create_object_internal+0xc3/0x300 [mdd] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc206aa4b>] > mdd_create_object+0x7b/0x820 [mdd] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc2074fd8>] > mdd_create+0xdd8/0x14a0 [mdd] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f0e118>] > mdt_reint_open+0x2588/0x3970 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc16f82b9>] ? > check_unlink_entry+0x19/0xd0 [obdclass] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1eede52>] ? > ucred_set_audit_enabled.isra.15+0x22/0x60 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f00f23>] > mdt_reint_rec+0x83/0x210 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edc413>] > mdt_reint_internal+0x6e3/0xaf0 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee8ec6>] ? > mdt_intent_fixup_resent+0x36/0x220 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee9132>] > mdt_intent_open+0x82/0x3a0 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edf74a>] > mdt_intent_opc+0x1ba/0xb50 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d6c0>] ? > lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee90b0>] ? > mdt_intent_fixup_resent+0x220/0x220 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee79e4>] > mdt_intent_policy+0x1a4/0x360 [mdt] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc19bc4e6>] > ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a22b7>] ? > cfs_hash_bd_add_locked+0x67/0x90 [libcfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a5a4e>] ? > cfs_hash_add+0xbe/0x1a0 [libcfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc19e3aa6>] > ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d740>] ? > lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a6d092>] > tgt_enqueue+0x62/0x210 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a73eea>] > tgt_request_handle+0xada/0x1570 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a4d601>] ? > ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1096bde>] ? > ktime_get_real_seconds+0xe/0x10 [libcfs] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a18bcb>] > ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a156e5>] ? > ptlrpc_wait_event+0xa5/0x360 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffff99d7dcf3>] ? > queued_spin_lock_slowpath+0xb/0xf > Mar 15 17:15:39 amds01a kernel: [<ffffffff99d8baa0>] ? > _raw_spin_lock+0x20/0x30 > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1c534>] > ptlrpc_main+0xb34/0x1470 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1ba00>] ? > ptlrpc_register_service+0xf80/0xf80 [ptlrpc] > Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5e61>] kthread+0xd1/0xe0 > Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? > insert_kthread_work+0x40/0x40 > Mar 15 17:15:39 amds01a kernel: [<ffffffff99d95ddd>] > ret_from_fork_nospec_begin+0x7/0x21 > Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? > insert_kthread_work+0x40/0x40 > > At this point all ZFS I/O freezes completely and the MDS has to be fenced. > This has happened ~4 times in the last hour. > > I'm at a loss how to correct this - I'm currently thinking that we may > have to rebuild and recover our entire filesystem from backups (thankfully > this is our home file system which is small and entirely ssd based, so > should not take to long to recover). > > May be related to this: > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216586 bug seen on > freebsd (with a much more recent ZFS version). > > The problem was first seen 3 weeks ago, but went away after a couple of > reboots. This time it seems to be more serious. > > Kind Regards, > Christopher. > > ------------------------------------ > Dr. Christopher Mountford, > System Specialist, > RCS, > Digital Services, > University Of Leicester. > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
