On Mon, Oct 20, 2025 at 11:20:58PM +0800, Shuai Xue wrote: > 2025/10/20 22:24, Lukas Wunner: > > On Mon, Oct 20, 2025 at 10:17:10PM +0800, Shuai Xue wrote: > > > > > .slot_reset() > > > > > => pci_restore_state() > > > > > => pci_aer_clear_status() > > > > > > > > This was added in 2015 by b07461a8e45b. The commit claims that > > > > the errors are stale and can be ignored. It turns out they cannot. > > > > > > > > So maybe pci_restore_state() should print information about the > > > > errors before clearing them? > > > > > > While that could work, we would lose the error severity information at > > > > Wait, we've got that saved in pci_cap_saved_state, so we could restore > > the severity register, report leftover errors, then clear those errors? > > You're right that the severity register is also sticky, so we could > retrieve error severity directly from AER registers. > > However, I have concerns about implementing this approach: [...] > 3. Architectural consistency: As you noted earlier, "pci_restore_state() > is only supposed to restore state, as the name implies, and not clear > errors." Adding error reporting to this function would further violate > this principle - we'd be making it do even more than just restore state. > > Would you prefer I implement this broader change, or shall we proceed > with the targeted helper function approach for now? The helper function > solves the immediate problem while keeping the changes focused on the > AER recovery path.
My opinion is that b07461a8e45b was wrong and that reported errors should not be silently ignored. What I'd prefer is that if pci_restore_state() discovers unreported errors, it asks the AER driver to report them. We've already got a helper to do that: aer_recover_queue() It queues up an entry in AER's kfifo and asks AER to report it. So far the function is only used by GHES. GHES allocates the aer_regs argument from ghes_estatus_pool using gen_pool_alloc(). Consequently aer_recover_work_func() uses ghes_estatus_pool_region_free() to free the allocation. That prevents using aer_recover_queue() for anything else than GHES. It would first be necessary to refactor aer_recover_queue() + aer_recover_work_func() such that it can cope with arbitrary allocations (e.g. kmalloc()). Thanks, Lukas
