On Fri, May 31, 2019 at 9:19 AM Josh Poimboeuf <jpoim...@redhat.com> wrote: > > On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote: > > On Fri, 31 May 2019, Josh Poimboeuf wrote: > > > > > The only question I'd have is if we have data on the power savings > > > difference between hlt and mwait. mwait seems to wake up on a lot of > > > different conditions which might negate its deeper sleep state. > > > > hlt wakes up on basically the same set of events, but has the > > auto-restarting semantics on some of them (especially SMM). So the wakeup > > frequency itself shouldn't really contribute to power consumption > > difference; it's the C-state that mwait allows CPU to enter. > > Ok. I reluctantly surrender :-) For your v4: > > Reviewed-by: Josh Poimboeuf <jpoim...@redhat.com> > > It works as a short term fix, but it's fragile, and it does feel like > we're just adding more duct tape, as Andy said. >
Just to clarify what I was thinking, it seems like soft-offlining a CPU and resuming a kernel have fundamentally different requirements. To soft-offline a CPU, we want to get power consumption as low as possible and make sure that MCE won't kill the system. It's okay for the CPU to occasionally execute some code. For resume, what we're really doing is trying to hand control of all CPUs from kernel A to kernel B. There are two basic ways to hand off control of a given CPU: we can jump (with JMP, RET, horrible self-modifying code, etc) from one kernel to the other, or we can attempt to make a given CPU stop executing code from either kernel at all and then forcibly wrench control of it in kernel B. Either approach seems okay, but the latter approach depends on getting the CPU to reliably stop executing code. We don't care about power consumption for resume, and I'm not even convinced that we need to be able to survive an MCE that happens while we're resuming, although surviving MCE would be nice. So if we don't want to depend on nasty system details at all, we could have the first kernel explicitly wake up all CPUs and hand them all off to the new kernel, more or less the same way that we hand over control of the BSP right now. Or we can look for a way to tell all the APs to stop executing kernel code, and the only architectural way I know of to do that is to sent an INIT IPI (and then presumably deassert INIT -- the SDM is a bit vague). Or we could allocate a page, stick a GDT, a TSS, and a 1: hlt; jmp 1b in it, turn off paging, and run that code. And then somehow convince the kernel we load not to touch that page until it finishes waking up all CPUs. This seems conceptually simple and very robust, but I'm not sure it fits in with the way hibernation works right now at all.