Michael Ellerman <m...@ellerman.id.au> writes:
> Greg Kurz <gr...@kaod.org> writes: >> On Tue, 04 Aug 2020 23:35:10 +1000 >> Michael Ellerman <m...@ellerman.id.au> wrote: >>> There is a bit of history to this code, but not in a good way :) >>> >>> Michael Roth <mdr...@linux.vnet.ibm.com> writes: >>> > For a power9 KVM guest with XIVE enabled, running a test loop >>> > where we hotplug 384 vcpus and then unplug them, the following traces >>> > can be seen (generally within a few loops) either from the unplugged >>> > vcpu: >>> > >>> > [ 1767.353447] cpu 65 (hwid 65) Ready to die... >>> > [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 >>> > [ 1767.952311] list_del corruption. next->prev should be >>> > c00a000002470208, but was c00a000002470048 >>> ... >>> > >>> > At that point the worker thread assumes the unplugged CPU is in some >>> > unknown/dead state and procedes with the cleanup, causing the race with >>> > the XIVE cleanup code executed by the unplugged CPU. >>> > >>> > Fix this by inserting an msleep() after each RTAS call to avoid >>> >>> We previously had an msleep(), but it was removed: >>> >>> b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") >> >> Ah, I hadn't seen that one... >> >>> > pseries_cpu_die() returning prematurely, and double the number of >>> > attempts so we wait at least a total of 5 seconds. While this isn't an >>> > ideal solution, it is similar to how we dealt with a similar issue for >>> > cede_offline mode in the past (940ce422a3). >>> >>> Thiago tried to fix this previously but there was a bit of discussion >>> that didn't quite resolve: >>> >>> >>> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/ >> >> Yeah it appears that the motivation at the time was to make the "Querying >> DEAD?" >> messages to disappear and to avoid potentially concurrent calls to >> rtas-stop-self >> which is prohibited by PAPR... not fixing actual crashes. > > I'm pretty sure at one point we were triggering crashes *in* RTAS via > this path, I think that got resolved. Yes, pHyp's RTAS now tolerates concurrent calls to stop-self. The original bug that was reported when I worked on this ended in an RTAS crash because of this timeout. The crash was fixed then. -- Thiago Jung Bauermann IBM Linux Technology Center