I managed to get a stack trace from a different machine. interestingly, not on the newer kernel/lustre versions. having no proof in hand, here's what i suspect is going on
podman asks for a file from lustre which includes extended attrs lustre client asks the filesystem for the data the data is returned to the kernel the data is scanned by trellix epo trellix kernel process silently crashes for some reason (probably because it doesn't handle lustre or xattrs very well/at all) lustre client hangs bear in mind i have no proof other then we've seen issues with trellix before. i suspect this issue has lurked for a long time, but is only now showing itself because podman makes use of extended attrs and locks. (none of our users knowingly do) i haven't been able to run printk on the original machine, we have a conference going on, so i can't muck with the machine at all. we unmounted lustre for the time being to get through, then we'll circle back this could be a red herring too, just fyi... [Thu Nov 13 13:19:48 2025] INFO: task podman:79754 blocked for more than 122 seconds. [Thu Nov 13 13:19:48 2025] Tainted: P W OE ------- --- 5.14.0-503.14.1.el9_5.x86_64 #1 [Thu Nov 13 13:19:48 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Nov 13 13:19:48 2025] task:podman state:D stack:0 pid:79754 tgid:79754 ppid:59232 flags:0x00000006 [Thu Nov 13 13:19:48 2025] Call Trace: [Thu Nov 13 13:19:48 2025] <TASK> [Thu Nov 13 13:19:48 2025] __schedule+0x229/0x550 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] schedule+0x2e/0xd0 [Thu Nov 13 13:19:48 2025] schedule_preempt_disabled+0x11/0x20 [Thu Nov 13 13:19:48 2025] __mutex_lock.constprop.0+0x433/0x6a0 [Thu Nov 13 13:19:48 2025] ? ___slab_alloc+0x626/0x7a0 [Thu Nov 13 13:19:48 2025] ll_xattr_find_get_lock+0x6c/0x490 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_cache_refill+0xb6/0xb80 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_cache_get+0x286/0x4b0 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_list+0x3c5/0x7e0 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_get_common+0x184/0x4a0 [lustre] [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] __vfs_getxattr+0x50/0x70 [Thu Nov 13 13:19:48 2025] get_vfs_caps_from_disk+0x70/0x210 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? __legitimize_path+0x27/0x60 [Thu Nov 13 13:19:48 2025] audit_copy_inode+0x99/0xd0 [Thu Nov 13 13:19:48 2025] filename_lookup+0x17b/0x1d0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? audit_filter_rules.constprop.0+0x2c5/0xd30 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? path_get+0x11/0x30 [Thu Nov 13 13:19:48 2025] vfs_statx+0x8d/0x170 [Thu Nov 13 13:19:48 2025] vfs_fstatat+0x54/0x70 [Thu Nov 13 13:19:48 2025] __do_sys_newfstatat+0x26/0x60 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? auditd_test_task+0x3c/0x50 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? __audit_syscall_entry+0xef/0x140 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? syscall_trace_enter.constprop.0+0x126/0x1a0 [Thu Nov 13 13:19:48 2025] do_syscall_64+0x5c/0xf0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? __count_memcg_events+0x4f/0xb0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? mm_account_fault+0x6c/0x100 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? handle_mm_fault+0x116/0x270 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? do_user_addr_fault+0x1d6/0x6a0 [Thu Nov 13 13:19:48 2025] ? syscall_exit_to_user_mode+0x19/0x40 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? exc_page_fault+0x62/0x150 [Thu Nov 13 13:19:48 2025] entry_SYSCALL_64_after_hwframe+0x78/0x80 [Thu Nov 13 13:19:48 2025] RIP: 0033:0x4137ce [Thu Nov 13 13:19:48 2025] RSP: 002b:000000c0004e0710 EFLAGS: 00000216 ORIG_RAX: 0000000000000106 [Thu Nov 13 13:19:48 2025] RAX: ffffffffffffffda RBX: ffffffffffffff9c RCX: 00000000004137ce [Thu Nov 13 13:19:48 2025] RDX: 000000c0001321d8 RSI: 000000c0001b0120 RDI: ffffffffffffff9c [Thu Nov 13 13:19:48 2025] RBP: 000000c0004e0750 R08: 0000000000000000 R09: 0000000000000000 [Thu Nov 13 13:19:48 2025] R10: 0000000000000100 R11: 0000000000000216 R12: 000000c0001b0120 [Thu Nov 13 13:19:48 2025] R13: 0000000000000155 R14: 000000c000002380 R15: 000000c0001321a0 [Thu Nov 13 13:19:48 2025] </TASK> [Thu Nov 13 13:19:48 2025] INFO: task (ostnamed):79810 blocked for more than 122 seconds. [Thu Nov 13 13:19:48 2025] Tainted: P W OE ------- --- 5.14.0-503.14.1.el9_5.x86_64 #1 [Thu Nov 13 13:19:48 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Nov 13 13:19:48 2025] task:(ostnamed) state:D stack:0 pid:79810 tgid:79810 ppid:1 flags:0x00000006 [Thu Nov 13 13:19:48 2025] Call Trace: [Thu Nov 13 13:19:48 2025] <TASK> [Thu Nov 13 13:19:48 2025] __schedule+0x229/0x550 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] schedule+0x2e/0xd0 [Thu Nov 13 13:19:48 2025] schedule_preempt_disabled+0x11/0x20 [Thu Nov 13 13:19:48 2025] __mutex_lock.constprop.0+0x433/0x6a0 [Thu Nov 13 13:19:48 2025] ? ___slab_alloc+0x626/0x7a0 [Thu Nov 13 13:19:48 2025] ll_xattr_find_get_lock+0x6c/0x490 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_cache_refill+0xb6/0xb80 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_cache_get+0x286/0x4b0 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_list+0x3c5/0x7e0 [lustre] [Thu Nov 13 13:19:48 2025] ll_xattr_get_common+0x184/0x4a0 [lustre] [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] __vfs_getxattr+0x50/0x70 [Thu Nov 13 13:19:48 2025] get_vfs_caps_from_disk+0x70/0x210 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? __legitimize_path+0x27/0x60 [Thu Nov 13 13:19:48 2025] audit_copy_inode+0x99/0xd0 [Thu Nov 13 13:19:48 2025] filename_lookup+0x17b/0x1d0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? path_get+0x11/0x30 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? audit_alloc_name+0x138/0x150 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] kern_path+0x2e/0x50 [Thu Nov 13 13:19:48 2025] mfe_aac_extract_path+0x77/0xe0 [mfe_aac_1007193773] [Thu Nov 13 13:19:48 2025] mfe_aac_sys_openat_64_bit+0x114/0x320 [mfe_aac_1007193773] [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? _copy_to_iter+0x17c/0x5f0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? auditd_test_task+0x3c/0x50 [Thu Nov 13 13:19:48 2025] ? mfe_fileaccess_sys_openat_64_bit+0x2f/0x1f0 [mfe_fileaccess_1007193773] [Thu Nov 13 13:19:48 2025] mfe_fileaccess_sys_openat_64_bit+0x2f/0x1f0 [mfe_fileaccess_1007193773] [Thu Nov 13 13:19:48 2025] do_syscall_64+0x5c/0xf0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? syscall_exit_work+0x103/0x130 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? syscall_exit_to_user_mode+0x19/0x40 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? do_syscall_64+0x6b/0xf0 [Thu Nov 13 13:19:48 2025] ? audit_reset_context.part.0.constprop.0+0xe5/0x2e0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? free_to_partial_list+0x80/0x280 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? mntput_no_expire+0x4a/0x250 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? audit_reset_context.part.0.constprop.0+0x273/0x2e0 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? syscall_exit_work+0x103/0x130 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? syscall_exit_to_user_mode+0x19/0x40 [Thu Nov 13 13:19:48 2025] ? srso_alias_return_thunk+0x5/0xfbef5 [Thu Nov 13 13:19:48 2025] ? do_syscall_64+0x6b/0xf0 [Thu Nov 13 13:19:48 2025] ? sysvec_apic_timer_interrupt+0x3c/0x90 [Thu Nov 13 13:19:48 2025] entry_SYSCALL_64_after_hwframe+0x78/0x80 [Thu Nov 13 13:19:48 2025] RIP: 0033:0x7fb4f3efdc54 [Thu Nov 13 13:19:48 2025] RSP: 002b:00007ffc61c44c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 [Thu Nov 13 13:19:48 2025] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb4f3efdc54 [Thu Nov 13 13:19:48 2025] RDX: 00000000002a0000 RSI: 00005599df0a79a0 RDI: 00000000ffffff9c [Thu Nov 13 13:19:48 2025] RBP: 00005599df0a79a0 R08: 0000000000000000 R09: 0000000000000000 [Thu Nov 13 13:19:48 2025] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000002a0000 [Thu Nov 13 13:19:48 2025] R13: 0000000000000000 R14: 0000000000001c27 R15: 00005599defce360 [Thu Nov 13 13:19:48 2025] </TASK> On Fri, Oct 31, 2025 at 2:42 PM John Hearns <[email protected]> wrote: > > For information, arpwatch can be used to alert on duplicated addresses > > https://en.wikipedia.org/wiki/Arpwatch > > On Fri, 31 Oct 2025 at 13:13, Michael DiDomenico via lustre-discuss > <[email protected]> wrote: >> >> unfortunately i don't think so. we're pretty good about assigning >> addresses, but still human. i don't see any evidence of a dup'd >> address, but i'll keep looking >> >> thanks >> >> On Thu, Oct 30, 2025 at 8:10 PM Mohr, Rick <[email protected]> wrote: >> > >> > Michael, >> > >> > It might be a long shot, but is there any chance another machine has the >> > same IP address as the one having problems? >> > >> > --Rick >> > >> > >> > >> > On 10/30/25, 3:09 PM, "lustre-discuss on behalf of Michael DiDomenico via >> > lustre-discuss" wrote: >> > our network is running 2.15.6 everywhere on rhel9.5, we recently built a >> > new machine using 2.15.7 on rhel9.6 and i'm seeing a strange problem. the >> > client is ethernet connected to ten lnet routers which bridge ethernet to >> > infiniband. i can mount the client just fine, read/write data, but then >> > several hours later, the client marks all the routers offline. the only >> > recovery is to lazy unmount, lustre_rmmod, and then restart the lustre >> > mount nothing unusual comes out in the journal/dmesg logs. to lustre it >> > "looks" like someone pulled the network cable, but there's no evidence >> > that this has happened physically or even at the switch/software layers we >> > upgraded two other machine to see if the problem replicates, but so far it >> > hasn't. the only significant difference between the three machines is the >> > one with the problem has heavy container (podman) usage, the others have >> > zero. i'm not sure if this is an cause or just a red herring any >> > suggestions >> > >> > >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
