On Wed, Dec 10, 2025 at 08:25:59AM -0600, Narayana Murty N wrote:
> The recent commit 1010b4c012b0 ("powerpc/eeh: Make EEH driver device
> hotplug safe") restructured the EEH driver to improve synchronization
> with the PCI hotplug layer.
>
> However, it inadvertently moved pci_lock_rescan_remove() outside its
> intended scope in eeh_handle_normal_event(), leading to broken PCI
> error reporting and improper EEH event triggering. Specifically,
> eeh_handle_normal_event() acquired pci_lock_rescan_remove() before
> calling eeh_pe_bus_get(), but eeh_pe_bus_get() itself attempts to
> acquire the same lock internally, causing nested locking and disrupting
> normal EEH event handling paths.
>
> This patch adds a boolean parameter do_lock to _eeh_pe_bus_get(),
> with two public wrappers:
> eeh_pe_bus_get() with locking enabled.
> eeh_pe_bus_get_nolock() that skips locking.
>
> Callers that already hold pci_lock_rescan_remove() now use
> eeh_pe_bus_get_nolock() to avoid recursive lock acquisition.
>
> Additionally, pci_lock_rescan_remove() calls are restored to the correct
> position—after eeh_pe_bus_get() and immediately before iterating affected
> PEs and devices. This ensures EEH-triggered PCI removes occur under proper
> bus rescan locking without recursive lock contention.
>
> The eeh_pe_loc_get() function has been split into two functions:
> eeh_pe_loc_get(struct eeh_pe *pe) which retrieves the loc for given PE.
> eeh_pe_loc_get_bus(struct pci_bus *bus) which retrieves the location
> code for given bus.
>
> This resolves lockdep warnings such as:
> <snip>
> [ 84.964298] [ T928] ============================================
> [ 84.964304] [ T928] WARNING: possible recursive locking detected
> [ 84.964311] [ T928] 6.18.0-rc3 #51 Not tainted
> [ 84.964315] [ T928] --------------------------------------------
> [ 84.964320] [ T928] eehd/928 is trying to acquire lock:
> [ 84.964324] [ T928] c000000003b29d58
> (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40
> [ 84.964342] [ T928]
> but task is already holding lock:
> [ 84.964347] [ T928] c000000003b29d58
> (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40
> [ 84.964357] [ T928]
> other info that might help us debug this:
> [ 84.964363] [ T928] Possible unsafe locking scenario:
>
> [ 84.964367] [ T928] CPU0
> [ 84.964370] [ T928] ----
> [ 84.964373] [ T928] lock(pci_rescan_remove_lock);
> [ 84.964378] [ T928] lock(pci_rescan_remove_lock);
> [ 84.964383] [ T928]
> *** DEADLOCK ***
>
> [ 84.964388] [ T928] May be due to missing lock nesting notation
>
> [ 84.964393] [ T928] 1 lock held by eehd/928:
> [ 84.964397] [ T928] #0: c000000003b29d58
> (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_lock_rescan_remove+0x28/0x40
> [ 84.964408] [ T928]
> stack backtrace:
> [ 84.964414] [ T928] CPU: 2 UID: 0 PID: 928 Comm: eehd Not tainted
> 6.18.0-rc3 #51 VOLUNTARY
> [ 84.964417] [ T928] Hardware name: IBM,9080-HEX POWER10 (architected)
> 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_022) hv:phyp pSeries
> [ 84.964419] [ T928] Call Trace:
> [ 84.964420] [ T928] [c0000011a7157990] [c000000001705de4]
> dump_stack_lvl+0xc8/0x130 (unreliable)
> [ 84.964424] [ T928] [c0000011a71579d0] [c0000000002f66e0]
> print_deadlock_bug+0x430/0x440
> [ 84.964428] [ T928] [c0000011a7157a70] [c0000000002fd0c0]
> __lock_acquire+0x1530/0x2d80
> [ 84.964431] [ T928] [c0000011a7157ba0] [c0000000002fea54]
> lock_acquire+0x144/0x410
> [ 84.964433] [ T928] [c0000011a7157cb0] [c0000011a7157cb0]
> __mutex_lock+0xf4/0x1050
> [ 84.964436] [ T928] [c0000011a7157e00] [c000000000de21d8]
> pci_lock_rescan_remove+0x28/0x40
> [ 84.964439] [ T928] [c0000011a7157e20] [c00000000004ed98]
> eeh_pe_bus_get+0x48/0xc0
> [ 84.964442] [ T928] [c0000011a7157e50] [c000000000050434]
> eeh_handle_normal_event+0x64/0xa60
> [ 84.964446] [ T928] [c0000011a7157f30] [c000000000051de8]
> eeh_event_handler+0xf8/0x190
> [ 84.964450] [ T928] [c0000011a7157f90] [c0000000002747ac]
> kthread+0x16c/0x180
> [ 84.964453] [ T928] [c0000011a7157fe0] [c00000000000ded8]
> start_kernel_thread+0x14/0x18
I have no comment on the patch itself, but the timestamps above aren't
relevant and could be removed. Maybe also the "T928" part.
Bjorn