> On Jan 14, 2019, at 3:27 PM, Andy Lutomirski <l...@kernel.org> wrote: > > On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin <h...@zytor.com> wrote: >> So I was already in the middle of composing this message when Andy posted: >> >>> I don't even think this is sufficient. I think we also need everyone >>> who clears the bit to check if all bits are clear and, if so, remove >>> the breakpoint. Otherwise we have a situation where, if you are in >>> text_poke_bp() and you take an NMI (or interrupt or MCE or whatever) >>> and that interrupt then hits the breakpoint, then you deadlock because >>> no one removes the breakpoint. >>> >>> If we do this, and if we can guarantee that all CPUs make forward >>> progress, then maybe the problem is solved. Can we guarantee something >>> like all NMI handlers that might wait in a spinlock or for any other >>> reason will periodically check if a sync is needed while they're >>> spinning? >> >> So the really, really nasty case is when an asynchronous event on the >> *patching* processor gets stuck spinning on a resource which is >> unavailable due to another processor spinning on the #BP. We can disable >> interrupts, but we can't stop NMIs from coming in (although we could >> test in the NMI handler if we are in that condition and return >> immediately; I'm not sure we want to do that, and we still have to deal >> with #MC and what not.) >> >> The fundamental problem here is that we don't see the #BP on the >> patching processor, in which case we could simply complete the patching >> from the #BP handler on that processor. >> >> On 1/13/19 6:40 PM, H. Peter Anvin wrote: >>> On 1/13/19 6:31 PM, H. Peter Anvin wrote: >>>> static cpumask_t text_poke_cpumask; >>>> >>>> static void text_poke_sync(void) >>>> { >>>> smp_wmb(); >>>> text_poke_cpumask = cpu_online_mask; >>>> smp_wmb(); /* Should be optional on x86 */ >>>> cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id()); >>>> on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false); >>>> while (!cpumask_empty(&text_poke_cpumask)) { >>>> cpu_relax(); >>>> smp_rmb(); >>>> } >>>> } >>>> >>>> static void text_poke_sync_cpu(void *dummy) >>>> { >>>> (void)dummy; >>>> >>>> smp_rmb(); >>>> cpumask_clear_cpu(&poke_bitmask, smp_processor_id()); >>>> /* >>>> * We are guaranteed to return with an IRET, either from the >>>> * IPI or the #BP handler; this provides serialization. >>>> */ >>>> } >>> >>> The invariants here are: >>> >>> 1. The patching routine must set each bit in the cpumask after each event >>> that requires synchronization is complete. >>> 2. The bit can be (atomically) cleared on the target CPU only, and only in a >>> place that guarantees a synchronizing event (e.g. IRET) before it may >>> reaching the poked instruction. >>> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It >>> *is* also possible to clear it in other places, e.g. the NMI handler, if >>> necessary as long as condition 2 is satisfied. >> >> OK, so with interrupts enabled *on the processor doing the patching* we >> still have a problem if it takes an interrupt which in turn takes a #BP. >> Disabling interrupts would not help, because but an NMI and #MC could >> still cause problems unless we can guarantee that no path which may be >> invoked by NMI/#MC can do text_poke, which seems to be a very aggressive >> assumption. >> >> Note: I am assuming preemption is disabled. >> >> The easiest/sanest way to deal with this might be to switch the IDT (or >> provide a hook in the generic exception entry code) on the patching >> processor, such that if an asynchronous event comes in, we either roll >> forward or revert. This is doable because the second sync we currently >> do is not actually necessary per the hardware guys. > > This is IMO insanely complicated. I much prefer the kind of > complexity that is more or less deterministic and easy to test to the > kind of complexity (like this) that only happens in corner cases. > > I see two solutions here: > > 1. Just suck it up and emulate the CALL. And find a way to write a > test case so we know it works. > > 2. Find a non-deadlocky way to make the breakpoint handler wait for > the breakpoint to get removed, without any mucking at all with the > entry code. And find a way to write a test case so we know it works. > (E.g. stick an actual static_call call site *in text_poke_bp()* that > fires once on boot so that the really awful recursive case gets > exercised all the time. > > But if we're going to do any mucking with the entry code, let's just > do the simple mucking to make emulating CALL work.
These two approaches still seem more complicated to me than having a trampoline as backup, which is patched dynamically. IIUC, the current pushback against this option is that it makes batching is more complicated. I am not sure it is that bad, but there are other variants of this solution, for example using an retpoline-like flow: BP-handler: /* Sets a per-CPU variable to hold the target */ this_cpu_write(interrupted_static_call_target) = get_static_call_targets(regs->rip); /* Choose the function based in IRQ-disable during interrupt */ if (regs->flags & X86_EFLAGS_IF) { regs->flags &= ~X86_EFLAGS_IF; regs->rip = user_handler_IRQ_disabled } else { regs->rip = user_handler_IRQ_enabled } user_handler_IRQ_disabled: push PER_CPU_VAR(interrupted_static_call_target) sti # this one is not needed in the the enabled case ret # sti-blocking prevents preemption before Obviously, I don’t know how this coexists with CET.