Hi Aurelien,

Thanks, I guess we will have to rebuild our own 2.15.x server. I see other 
crashes have different dump, usually like these:

[36664.403408] BUG: unable to handle kernel NULL pointer dereference at 
0000000000000000
[36664.411237] PGD 0 P4D 0
[36664.413776] Oops: 0000 [#1] SMP PTI
[36664.417268] CPU: 28 PID: 11101 Comm: qmt_reba_cedar_ Kdump: loaded Tainted: 
G          IOE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[36664.430293] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 
06/04/2023
[36664.437860] RIP: 0010:qmt_id_lock_cb+0x69/0x100 [lquota]
[36664.443199] Code: 48 8b 53 20 8b 4a 0c 85 c9 74 78 89 c1 48 8b 42 18 83 78 
10 02 75 0a 83 e1 01 b8 01 00 00 00 74 17 48 63 44 24 04 48 c1 e0 04 <48> 03 45 
00 f6 40 08 0c 0f 95 c0 0f b6 c0 48 8b 4c 24 08 65 48 33
[36664.461942] RSP: 0018:ffffaa2e303f3df0 EFLAGS: 00010246
[36664.467169] RAX: 0000000000000000 RBX: ffff98722c74b700 RCX: 0000000000000000
[36664.474301] RDX: ffff9880415ce660 RSI: 0000000000000010 RDI: ffff9881240b5c64
[36664.481435] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004
[36664.488566] R10: 0000000000000010 R11: f000000000000000 R12: ffff98722c74b700
[36664.495697] R13: ffff9875fc07a320 R14: ffff9878444d3d10 R15: ffff9878444d3cc0
[36664.502832] FS:  0000000000000000(0000) GS:ffff987f20f80000(0000) 
knlGS:0000000000000000
[36664.510917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[36664.516664] CR2: 0000000000000000 CR3: 0000002065a10004 CR4: 00000000007706e0
[36664.523794] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[36664.530927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[36664.538058] PKRU: 55555554
[36664.540772] Call Trace:
[36664.543231]  ? cfs_cdebug_show.part.3.constprop.23+0x20/0x20 [lquota]
[36664.549699]  qmt_glimpse_lock.isra.20+0x1e7/0xfa0 [lquota]
[36664.555204]  qmt_reba_thread+0x5cd/0x9b0 [lquota]
[36664.559927]  ? qmt_glimpse_lock.isra.20+0xfa0/0xfa0 [lquota]
[36664.565602]  kthread+0x134/0x150
[36664.568834]  ? set_kthread_struct+0x50/0x50
[36664.573021]  ret_from_fork+0x1f/0x40
[36664.576603] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) 
mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) 
lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ko2iblnd(OE) 
ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc dell_rbu 
vfat fat dm_round_robin dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert 
iscsi_target_mod target_core_mod ib_iser libiscsi opa_vnic scsi_transport_iscsi 
ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr intel_rapl_common 
isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel dell_smbios iTCO_wdt iTCO_vendor_support wmi_bmof 
dell_wmi_descriptor dcdbas kvm ipmi_ssif irqbypass crct10dif_pclmul hfi1 
mgag200 crc32_pclmul drm_shmem_helper ghash_clmulni_intel rdmavt qla2xxx 
drm_kms_helper rapl ib_uverbs nvme_fc intel_cstate syscopyarea nvme_fabrics 
sysfillrect sysimgblt nvme_core intel_uncore fb_sys_fops pcspkr acpi_ipmi 
ib_core scsi_transport_fc igb
[36664.576699]  drm ipmi_si i2c_algo_bit mei_me dca ipmi_devintf mei i2c_i801 
lpc_ich wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg 
ahci libahci crc32c_intel libata megaraid_sas dm_mirror dm_region_hash dm_log 
dm_mod
[36664.684758] CR2: 0000000000000000

Is this also related to the same bug?

Thanks,

Lixin.

From: Aurelien Degremont <adegrem...@nvidia.com>
Date: Wednesday, November 29, 2023 at 8:31 AM
To: lustre-discuss <lustre-discuss@lists.lustre.org>, Lixin Liu <l...@sfu.ca>
Subject: RE: MDS crashes, lustre version 2.15.3

You are likely hitting that bug https://jira.whamcloud.com/browse/LU-15207 
which is fixed in (not yet released) 2.16.0

Aurélien
________________________________
De : lustre-discuss <lustre-discuss-boun...@lists.lustre.org> de la part de 
Lixin Liu via lustre-discuss <lustre-discuss@lists.lustre.org>
Envoyé : mercredi 29 novembre 2023 17:18
À : lustre-discuss <lustre-discuss@lists.lustre.org>
Objet : [lustre-discuss] MDS crashes, lustre version 2.15.3

External email: Use caution opening links or attachments


Hi,

We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs 
are using ZFS.
The system seems to perform well at the beginning, but recently, we see 
frequent MDS crashes.
The vmcore-dmesg.txt shows the following:

[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) 
ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 
4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023
[26056.063099] Call Trace TBD:
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.096008] [<0>] kthread+0x134/0x150
[26056.100098] [<0>] ret_from_fork+0x35/0x40
[26056.104575] Kernel panic - not syncing: LBUG
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: 
G           OE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[26056.123892] Hardware name:  /086D43, BIOS 2.17.0 03/15/2023
[26056.130108] Call Trace:
[26056.132833]  dump_stack+0x41/0x60
[26056.136532]  panic+0xe7/0x2ac
[26056.139843]  ? ret_from_fork+0x35/0x40
[26056.144022]  ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]
[26056.149762]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[26056.155306]  cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.161335]  ? wait_for_completion+0xb8/0x100
[26056.166196]  qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.172128]  qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.177381]  ? __schedule+0x2d9/0x870
[26056.181466]  ? qsd_bump_version+0x3b0/0x3b0 [lquota]
[26056.187010]  kthread+0x134/0x150
[26056.190608]  ? set_kthread_struct+0x50/0x50
[26056.195272]  ret_from_fork+0x35/0x40

We also experienced unexpected OST drop (change to inactive mode) from login 
nodes and the only
way to bring it back is to reboot the client.

Any suggestions?

Thanks,

Lixin Liu
Simon Fraser University

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to