On Mon, Oct 17, 2016 at 5:30 PM, Rafael J. Wysocki <r...@rjwysocki.net> wrote: > On Sunday, October 16, 2016 09:50:23 AM Andy Lutomirski wrote: >> On Sat, Oct 8, 2016 at 3:31 AM, Rafael J. Wysocki <raf...@kernel.org> wrote: >> > On Fri, Oct 7, 2016 at 9:47 PM, Andy Lutomirski <l...@kernel.org> wrote: >> >> On 06/25/2016 09:19 AM, Chen Yu wrote: >> >>> >> >>> Here's the story of what the problem is, why this >> >>> happened, and why this patch looks like this: >> >>> >> >>> Stress test from Varun Koyyalagunta reports that, the >> >>> nonboot CPU would hang occasionally, when resuming from >> >>> hibernation. Further investigation shows that, the precise >> >>> stage when nonboot CPU hangs, is the time when the nonboot >> >>> CPU been woken up incorrectly, and tries to monitor the >> >>> mwait_ptr for the second time, then an exception is >> >>> triggered due to illegal vaddr access, say, something like, >> >>> 'Unable to handler kernel address of 0xffff8800ba800010...' >> >>> >> >>> Further investigation shows that, the exception is caused >> >>> by accessing a page without PRESENT flag, because the pte entry >> >>> for this vaddr is zero. Here's the scenario how this problem >> >>> happens: Page table for direct mapping is allocated dynamically >> >>> by kernel_physical_mapping_init, it is possible that in the >> >>> resume process, when the boot CPU is trying to write back pages >> >>> to their original address, and just right to writes to the monitor >> >>> mwait_ptr then wakes up one of the nonboot CPUs, since the page >> >>> table currently used by the nonboot CPU might not the same as it >> >>> is before the hibernation, an exception might occur due to >> >>> inconsistent page table. >> >>> >> >>> First try is to get rid of this problem by changing the monitor >> >>> address from task.flag to zero page, because one one would write >> >>> to zero page. But this still have problem because of ping-pong >> >>> wake up situation in mwait_play_dead: >> >>> >> >>> One possible implementation of a clflush is a read-invalidate snoop, >> >>> which is what a store might look like, so cflush might break the mwait. >> >>> >> >>> 1. CPU1 wait at zero page >> >>> 2. CPU2 cflush zero page, wake CPU1 up, then CPU2 waits at zero page >> >>> 3. CPU1 is woken up, and invoke cflush zero page, thus wake up CPU2 >> >>> again. >> >>> then the nonboot CPUs never sleep for long. >> >>> >> >>> So it's better to monitor different address for each >> >>> nonboot CPUs, however since there is only one zero page, at most: >> >>> PAGE_SIZE/L1_CACHE_LINE CPUs are satisfied, which is usually 64 >> >>> on a x86_64, apparently it's not enough for servers, maybe more >> >>> zero pages are required. >> >>> >> >>> So choose the solution as Brian suggested, to put the nonboot CPUs >> >>> into hlt before resuming. But Rafael has mentioned that, if some of >> >>> the CPUs have already been offline before hibernation, then the problem >> >>> is still there. So this patch tries to kick the already offline CPUs >> >>> woken >> >>> up and fall into hlt, and then put the rest online CPUs into hlt. >> >>> In this way, all the nonboot CPUs will wait at a safe state, >> >>> without touching any memory during s/r. (It's not safe to modify >> >>> mwait_play_dead, because once previous offline CPUs are woken up, >> >>> it will either access text code, whose page table is not safe anymore >> >>> across hibernation, due to: >> >>> Commit ab76f7b4ab23 ("x86/mm: Set NX on gap between __ex_table and >> >>> rodata"). >> >>> >> >> >> >> I realize I'm extremely late to the party, but I must admit that I don't >> >> get >> >> it. Sure, hibernation resume can spuriously wake the non-boot CPU, but at >> >> some point it has to wake up for real. >> > >> > You mean during resume? We reinit from scratch then. >> > >> >> What ensures that the text it was >> >> running (native_play_dead or whatever) is still there when it wakes up? >> >> >> >> Or does the hibernation resume code actually send the remote CPU an >> >> INIT-SIPI sequence a la wakeup_secondary_cpu_via_init()? >> > >> > That's what happens AFAICS. >> > >> >> If so, this seems >> >> a bit odd to me. Shouldn't we kick the CPU all the way to the >> >> wait-for-SIPI >> >> state rather than getting it to play dead via hlt or mwait? >> > >> > We could do that. It would be a bit cleaner than using the "hlt play >> > dead" thing, but the practical difference would be very small (if >> > observable at all). >> >> Probably true. It might be worth changing the "hlt" path to something like: >> >> asm volatile ("hlt"); >> WARN(1, "CPU woke directly from halt-for-resume -- should have been >> woken by SIPI\n"); > > The visibility of that warning would be sort of limited, though, because the > only case it might show up is when the system went belly up due to an > unhandled > page fault.
Righto. Maybe at least add a comment that the hlt isn't intended to ever resume on the next instruction? --Andy