Re: [lustre-discuss] [EXTERNAL] client failing off network

Michael DiDomenico via lustre-discuss Mon, 17 Nov 2025 10:42:54 -0800

I managed to get a stack trace from a different machine.
interestingly, not on the newer kernel/lustre versions.  having no
proof in hand, here's what i suspect is going on


podman asks for a file from lustre which includes extended attrs
lustre client asks the filesystem for the data
the data is returned to the kernel
the data is scanned by trellix epo
trellix kernel process silently crashes for some reason (probably
because it doesn't handle lustre or xattrs very well/at all)
lustre client hangs

bear in mind i have no proof other then we've seen issues with trellix
before.  i suspect this issue has lurked for a long time, but is only
now showing itself because podman makes use of extended attrs and
locks.  (none of our users knowingly do)

i haven't been able to run printk on the original machine, we have a
conference going on, so i can't muck with the machine at all.  we
unmounted lustre for the time being to get through, then we'll circle
back

this could be a red herring too, just fyi...

[Thu Nov 13 13:19:48 2025] INFO: task podman:79754 blocked for more
than 122 seconds.
[Thu Nov 13 13:19:48 2025]       Tainted: P        W  OE     -------
---  5.14.0-503.14.1.el9_5.x86_64 #1
[Thu Nov 13 13:19:48 2025] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Nov 13 13:19:48 2025] task:podman          state:D stack:0
pid:79754 tgid:79754 ppid:59232  flags:0x00000006
[Thu Nov 13 13:19:48 2025] Call Trace:
[Thu Nov 13 13:19:48 2025]  <TASK>
[Thu Nov 13 13:19:48 2025]  __schedule+0x229/0x550
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  schedule+0x2e/0xd0
[Thu Nov 13 13:19:48 2025]  schedule_preempt_disabled+0x11/0x20
[Thu Nov 13 13:19:48 2025]  __mutex_lock.constprop.0+0x433/0x6a0
[Thu Nov 13 13:19:48 2025]  ? ___slab_alloc+0x626/0x7a0
[Thu Nov 13 13:19:48 2025]  ll_xattr_find_get_lock+0x6c/0x490 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_refill+0xb6/0xb80 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_get+0x286/0x4b0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_list+0x3c5/0x7e0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_get_common+0x184/0x4a0 [lustre]
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  __vfs_getxattr+0x50/0x70
[Thu Nov 13 13:19:48 2025]  get_vfs_caps_from_disk+0x70/0x210
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __legitimize_path+0x27/0x60
[Thu Nov 13 13:19:48 2025]  audit_copy_inode+0x99/0xd0
[Thu Nov 13 13:19:48 2025]  filename_lookup+0x17b/0x1d0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? audit_filter_rules.constprop.0+0x2c5/0xd30
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? path_get+0x11/0x30
[Thu Nov 13 13:19:48 2025]  vfs_statx+0x8d/0x170
[Thu Nov 13 13:19:48 2025]  vfs_fstatat+0x54/0x70
[Thu Nov 13 13:19:48 2025]  __do_sys_newfstatat+0x26/0x60
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? auditd_test_task+0x3c/0x50
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __audit_syscall_entry+0xef/0x140
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_trace_enter.constprop.0+0x126/0x1a0
[Thu Nov 13 13:19:48 2025]  do_syscall_64+0x5c/0xf0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __count_memcg_events+0x4f/0xb0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? mm_account_fault+0x6c/0x100
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? handle_mm_fault+0x116/0x270
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? do_user_addr_fault+0x1d6/0x6a0
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_to_user_mode+0x19/0x40
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? exc_page_fault+0x62/0x150
[Thu Nov 13 13:19:48 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Thu Nov 13 13:19:48 2025] RIP: 0033:0x4137ce
[Thu Nov 13 13:19:48 2025] RSP: 002b:000000c0004e0710 EFLAGS: 00000216
ORIG_RAX: 0000000000000106
[Thu Nov 13 13:19:48 2025] RAX: ffffffffffffffda RBX: ffffffffffffff9c
RCX: 00000000004137ce
[Thu Nov 13 13:19:48 2025] RDX: 000000c0001321d8 RSI: 000000c0001b0120
RDI: ffffffffffffff9c
[Thu Nov 13 13:19:48 2025] RBP: 000000c0004e0750 R08: 0000000000000000
R09: 0000000000000000
[Thu Nov 13 13:19:48 2025] R10: 0000000000000100 R11: 0000000000000216
R12: 000000c0001b0120
[Thu Nov 13 13:19:48 2025] R13: 0000000000000155 R14: 000000c000002380
R15: 000000c0001321a0
[Thu Nov 13 13:19:48 2025]  </TASK>

[Thu Nov 13 13:19:48 2025] INFO: task (ostnamed):79810 blocked for
more than 122 seconds.
[Thu Nov 13 13:19:48 2025]       Tainted: P        W  OE     -------
---  5.14.0-503.14.1.el9_5.x86_64 #1
[Thu Nov 13 13:19:48 2025] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Nov 13 13:19:48 2025] task:(ostnamed)      state:D stack:0
pid:79810 tgid:79810 ppid:1      flags:0x00000006
[Thu Nov 13 13:19:48 2025] Call Trace:
[Thu Nov 13 13:19:48 2025]  <TASK>
[Thu Nov 13 13:19:48 2025]  __schedule+0x229/0x550
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  schedule+0x2e/0xd0
[Thu Nov 13 13:19:48 2025]  schedule_preempt_disabled+0x11/0x20
[Thu Nov 13 13:19:48 2025]  __mutex_lock.constprop.0+0x433/0x6a0
[Thu Nov 13 13:19:48 2025]  ? ___slab_alloc+0x626/0x7a0
[Thu Nov 13 13:19:48 2025]  ll_xattr_find_get_lock+0x6c/0x490 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_refill+0xb6/0xb80 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_get+0x286/0x4b0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_list+0x3c5/0x7e0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_get_common+0x184/0x4a0 [lustre]
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  __vfs_getxattr+0x50/0x70
[Thu Nov 13 13:19:48 2025]  get_vfs_caps_from_disk+0x70/0x210
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __legitimize_path+0x27/0x60
[Thu Nov 13 13:19:48 2025]  audit_copy_inode+0x99/0xd0
[Thu Nov 13 13:19:48 2025]  filename_lookup+0x17b/0x1d0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? path_get+0x11/0x30
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? audit_alloc_name+0x138/0x150
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  kern_path+0x2e/0x50
[Thu Nov 13 13:19:48 2025]  mfe_aac_extract_path+0x77/0xe0 [mfe_aac_1007193773]
[Thu Nov 13 13:19:48 2025]  mfe_aac_sys_openat_64_bit+0x114/0x320
[mfe_aac_1007193773]
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? _copy_to_iter+0x17c/0x5f0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? auditd_test_task+0x3c/0x50
[Thu Nov 13 13:19:48 2025]  ?
mfe_fileaccess_sys_openat_64_bit+0x2f/0x1f0
[mfe_fileaccess_1007193773]
[Thu Nov 13 13:19:48 2025]
mfe_fileaccess_sys_openat_64_bit+0x2f/0x1f0
[mfe_fileaccess_1007193773]
[Thu Nov 13 13:19:48 2025]  do_syscall_64+0x5c/0xf0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_work+0x103/0x130
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_to_user_mode+0x19/0x40
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? do_syscall_64+0x6b/0xf0
[Thu Nov 13 13:19:48 2025]  ? audit_reset_context.part.0.constprop.0+0xe5/0x2e0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? free_to_partial_list+0x80/0x280
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? mntput_no_expire+0x4a/0x250
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? audit_reset_context.part.0.constprop.0+0x273/0x2e0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_work+0x103/0x130
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_to_user_mode+0x19/0x40
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? do_syscall_64+0x6b/0xf0
[Thu Nov 13 13:19:48 2025]  ? sysvec_apic_timer_interrupt+0x3c/0x90
[Thu Nov 13 13:19:48 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Thu Nov 13 13:19:48 2025] RIP: 0033:0x7fb4f3efdc54
[Thu Nov 13 13:19:48 2025] RSP: 002b:00007ffc61c44c90 EFLAGS: 00000293
ORIG_RAX: 0000000000000101
[Thu Nov 13 13:19:48 2025] RAX: ffffffffffffffda RBX: 0000000000000000
RCX: 00007fb4f3efdc54
[Thu Nov 13 13:19:48 2025] RDX: 00000000002a0000 RSI: 00005599df0a79a0
RDI: 00000000ffffff9c
[Thu Nov 13 13:19:48 2025] RBP: 00005599df0a79a0 R08: 0000000000000000
R09: 0000000000000000
[Thu Nov 13 13:19:48 2025] R10: 0000000000000000 R11: 0000000000000293
R12: 00000000002a0000
[Thu Nov 13 13:19:48 2025] R13: 0000000000000000 R14: 0000000000001c27
R15: 00005599defce360
[Thu Nov 13 13:19:48 2025]  </TASK>


On Fri, Oct 31, 2025 at 2:42 PM John Hearns <[email protected]> wrote:
>
> For information,  arpwatch  can be used to alert on duplicated addresses
>
> https://en.wikipedia.org/wiki/Arpwatch
>
> On Fri, 31 Oct 2025 at 13:13, Michael DiDomenico via lustre-discuss 
> <[email protected]> wrote:
>>
>> unfortunately i don't think so.  we're pretty good about assigning
>> addresses, but still human.  i don't see any evidence of a dup'd
>> address, but i'll keep looking
>>
>> thanks
>>
>> On Thu, Oct 30, 2025 at 8:10 PM Mohr, Rick <[email protected]> wrote:
>> >
>> > Michael,
>> >
>> > It might be a long shot, but is there any chance another machine has the 
>> > same IP address as the one having problems?
>> >
>> > --Rick
>> >
>> >
>> >
>> > On 10/30/25, 3:09 PM, "lustre-discuss on behalf of Michael DiDomenico via 
>> > lustre-discuss" wrote:
>> > our network is running 2.15.6 everywhere on rhel9.5, we recently built a 
>> > new machine using 2.15.7 on rhel9.6 and i'm seeing a strange problem. the 
>> > client is ethernet connected to ten lnet routers which bridge ethernet to 
>> > infiniband. i can mount the client just fine, read/write data, but then 
>> > several hours later, the client marks all the routers offline. the only 
>> > recovery is to lazy unmount, lustre_rmmod, and then restart the lustre 
>> > mount nothing unusual comes out in the journal/dmesg logs. to lustre it 
>> > "looks" like someone pulled the network cable, but there's no evidence 
>> > that this has happened physically or even at the switch/software layers we 
>> > upgraded two other machine to see if the problem replicates, but so far it 
>> > hasn't. the only significant difference between the three machines is the 
>> > one with the problem has heavy container (podman) usage, the others have 
>> > zero. i'm not sure if this is an cause or just a red herring any 
>> > suggestions
>> >
>> >
>> _______________________________________________
>> lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] [EXTERNAL] client failing off network

Reply via email to