Hi Eric,
We (and others I've talked to) have the same problem.
The solution is to downgrade the kernel to the Rocky 9.4 one, and
downgrade the Lustre RPMs to 2.15.5.
We did find an issue against it, and a patch, so I guess we need to wait
for 2.15.7. (it seemed not to make it into 2.15.6)
https://jira.whamcloud.com/projects/LU/issues/LU-18085
Cheers,
Alastair.
On Wed, 12 Feb 2025, Walter, Eric wrote:
You don't often get email from [email protected]. Learn why this is
important<https://aka.ms/LearnAboutSenderIdentification>
[EXTERNAL EMAIL]
Hello,
We have recently upgraded a cluster to Rocky 9.5 (kernel
version5.14.0-503.22.1.el9_5.x86_64). After upgrading to lustre-2.15.6 client,
we are seeing repeated kernel oops / crashes when jobs are reading/writing to
both of our lustre filesystems after about 3-4 hours of running. It is
repeatable and results in a Kernel oops referencing the ldlm process of lustre.
This is just our clients that are on Rocky 9.5, none other systems are having
issues.
We would normally mount with o2ib (we upgraded to Mellanox driver version
24.10-1.1.4.0 for Rocky 9.5), however, our tests still result in the same ldlm
kernel oops when mounted over tcp.
The oops related output in from vmcore-dmesg.txt is posted below.
I have looked for various known issues with 2.15.6 and can't find anyone else
reporting this. Any ideas on what to do besides downgrade to Rocky 9.4? Has
anyone else seen such a problem with 9.5 and clients using v2.15.6?
[ 6267.182434] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 6267.182441] #PF: supervisor write access in kernel mode
[ 6267.182443] #PF: error_code(0x0002) - not-present page
[ 6267.182444] PGD 1924d7067 P4D 134554067 PUD 10ac05067 PMD 0
[ 6267.182449] Oops: 0002 [#1] PREEMPT SMP NOPTI
6267.182451] CPU: 15 PID: 3599 Comm: ldlm_bl_04 Kdump: loaded Tainted: G
OE ------- --- 5.14.0-503.22.1.el9_5.x86_64 #1
[ 6267.182454] Hardware name: Dell Inc. PowerEdge R6625/0NWPW3, BIOS 1.5.8
07/21/2023
[ 6267.182455] RIP: 0010:ll_prune_negative_children+0x9d/0x250 [lustre]
[ 6267.182483] Code: 00 00 48 85 ed 74 46 48 81 ed 98 00 00 00 74 3d 48 83 7d 30 00
75 e4 4c 8d 7d 60 4c 89 ff e8 da 20 fb cf 48 8b 85 80 00 00 00 <80> 48 04 01 8b
45 64 85 c0 0f 84 ae 00 00 00 4c 89 ff e8 ac 21 fb
[ 6267.182485] RSP: 0018:ff75eed96a0c7c90 EFLAGS: 00010246
[ 6267.182487] RAX: 0000000000000000 RBX: ff28db3ed37d92c0 RCX: 0000000000000000
[ 6267.182488] RDX: 0000000000000001 RSI: ff28db0fdb1e00b0 RDI: ff28db0fc22c9860
[ 6267.182489] RBP: ff28db0fc22c9800 R08: 0000000000000000 R09: ffffffa1dd3f0088
[ 6267.182489] R10: ff28db3ec76f5c00 R11: 000000000005eee0 R12: ff28db3ed37d9320
[ 6267.182490] R13: ff28db3ece52d528 R14: ff28db3ece52d4a0 R15: ff28db0fc22c9860
[ 6267.182491] FS: 0000000000000000(0000) GS:ff28db3dfebc0000(0000)
knlGS:0000000000000000
[ 6267.182493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6267.182494] CR2: 0000000000000004 CR3: 0000000138eec006 CR4: 0000000000771ef0
[ 6267.182495] PKRU: 55555554
[ 6267.182495] Call Trace:
[ 6267.182499] <TASK>
[ 6267.182500] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6267.182506] ? show_trace_log_lvl+0x26e/0x2df
[ 6267.182513] ? show_trace_log_lvl+0x26e/0x2df
[ 6267.182517] ? ll_lock_cancel_bits+0x73a/0x760 [lustre]
[ 6267.182535] ? __die_body.cold+0x8/0xd
[ 6267.182538] ? page_fault_oops+0x134/0x170
[ 6267.182542] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6267.182545] ? exc_page_fault+0x62/0x150
[ 6267.182549] ? asm_exc_page_fault+0x22/0x30
[ 6267.182553] ? ll_prune_negative_children+0x9d/0x250 [lustre]
[ 6267.182570] ll_lock_cancel_bits+0x73a/0x760 [lustre]
[ 6267.182588] ll_md_blocking_ast+0x1a3/0x300 [lustre]
[ 6267.182606] ldlm_cancel_callback+0x7a/0x290 [ptlrpc]
[ 6267.182639] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6267.182642] ldlm_cli_cancel_local+0xce/0x440 [ptlrpc]
[ 6267.182674] ldlm_cli_cancel+0x271/0x520 [ptlrpc]
[ 6267.182705] ll_md_blocking_ast+0x1cd/0x300 [lustre]
[ 6267.182722] ldlm_handle_bl_callback+0x105/0x3e0 [ptlrpc]
[ 6267.182753] ldlm_bl_thread_blwi.constprop.0+0xa7/0x340 [ptlrpc]
[ 6267.182782] ldlm_bl_thread_main+0x533/0x610 [ptlrpc]
[ 6267.182811] ? __pfx_autoremove_wake_function+0x10/0x10
[ 6267.182817] ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc]
[ 6267.182846] kthread+0xdd/0x100
[ 6267.182851] ? __pfx_kthread+0x10/0x10
[ 6267.182853] ret_from_fork+0x29/0x50
[ 6267.182859] </TASK>
[ 6267.182860] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE)
lov(OE) fld(OE) osc(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) lnet(OE)
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver libcfs(OE) nfs lockd grace
fscache netfs rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
ib_umad(OE) sunrpc binfmt_misc vfat fat amd_atl intel_rapl_msr ipmi_ssif
intel_rapl_common amd64_edac dell_wmi edac_mce_amd ledtrig_audio sparse_keymap
rfkill kvm_amd mgag200 acpi_ipmi i2c_algo_bit video drm_shmem_helper kvm
ipmi_si dell_smbios ipmi_devintf dcdbas drm_kms_helper dell_wmi_descriptor rapl
wmi_bmof pcspkr i2c_piix4 ipmi_msghandler k10temp acpi_power_meter fuse drm xfs
libcrc32c mlx5_ib(OE) macsec ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE)
sd_mod t10_pi psample ahci mlxdevm(OE) sg libahci mlx_compat(OE)
crct10dif_pclmul crc32_pclmul crc32c_intel tls libata ghash_clmulni_intel tg3
ccp megaraid_sas pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash dm_log
dm_mod xpmem(OE)
[ 6267.182922] CR2: 0000000000000004
Thanks for any help you can provide.
Eric
--
Eric J. Walter
Executive Director, Research Computing
Information Technology
William & Mary
Office: 757-221-1886
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org