Re: [PATCH v3 0/6] Static calls

Nadav Amit Mon, 14 Jan 2019 15:52:28 -0800

> On Jan 14, 2019, at 3:27 PM, Andy Lutomirski <l...@kernel.org> wrote:
> 
> On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin <h...@zytor.com> wrote:
>> So I was already in the middle of composing this message when Andy posted:
>> 
>>> I don't even think this is sufficient.  I think we also need everyone
>>> who clears the bit to check if all bits are clear and, if so, remove
>>> the breakpoint.  Otherwise we have a situation where, if you are in
>>> text_poke_bp() and you take an NMI (or interrupt or MCE or whatever)
>>> and that interrupt then hits the breakpoint, then you deadlock because
>>> no one removes the breakpoint.
>>> 
>>> If we do this, and if we can guarantee that all CPUs make forward
>>> progress, then maybe the problem is solved. Can we guarantee something
>>> like all NMI handlers that might wait in a spinlock or for any other
>>> reason will periodically check if a sync is needed while they're
>>> spinning?
>> 
>> So the really, really nasty case is when an asynchronous event on the
>> *patching* processor gets stuck spinning on a resource which is
>> unavailable due to another processor spinning on the #BP. We can disable
>> interrupts, but we can't stop NMIs from coming in (although we could
>> test in the NMI handler if we are in that condition and return
>> immediately; I'm not sure we want to do that, and we still have to deal
>> with #MC and what not.)
>> 
>> The fundamental problem here is that we don't see the #BP on the
>> patching processor, in which case we could simply complete the patching
>> from the #BP handler on that processor.
>> 
>> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
>>> On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>>>> static cpumask_t text_poke_cpumask;
>>>> 
>>>> static void text_poke_sync(void)
>>>> {
>>>>     smp_wmb();
>>>>     text_poke_cpumask = cpu_online_mask;
>>>>     smp_wmb();      /* Should be optional on x86 */
>>>>     cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
>>>>     on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu, NULL, false);
>>>>     while (!cpumask_empty(&text_poke_cpumask)) {
>>>>             cpu_relax();
>>>>             smp_rmb();
>>>>     }
>>>> }
>>>> 
>>>> static void text_poke_sync_cpu(void *dummy)
>>>> {
>>>>     (void)dummy;
>>>> 
>>>>     smp_rmb();
>>>>     cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
>>>>     /*
>>>>      * We are guaranteed to return with an IRET, either from the
>>>>      * IPI or the #BP handler; this provides serialization.
>>>>      */
>>>> }
>>> 
>>> The invariants here are:
>>> 
>>> 1. The patching routine must set each bit in the cpumask after each event
>>>   that requires synchronization is complete.
>>> 2. The bit can be (atomically) cleared on the target CPU only, and only in a
>>>   place that guarantees a synchronizing event (e.g. IRET) before it may
>>>   reaching the poked instruction.
>>> 3. At a minimum the IPI handler and #BP handler needs to clear the bit. It
>>>   *is* also possible to clear it in other places, e.g. the NMI handler, if
>>>   necessary as long as condition 2 is satisfied.
>> 
>> OK, so with interrupts enabled *on the processor doing the patching* we
>> still have a problem if it takes an interrupt which in turn takes a #BP.
>> Disabling interrupts would not help, because but an NMI and #MC could
>> still cause problems unless we can guarantee that no path which may be
>> invoked by NMI/#MC can do text_poke, which seems to be a very aggressive
>> assumption.
>> 
>> Note: I am assuming preemption is disabled.
>> 
>> The easiest/sanest way to deal with this might be to switch the IDT (or
>> provide a hook in the generic exception entry code) on the patching
>> processor, such that if an asynchronous event comes in, we either roll
>> forward or revert. This is doable because the second sync we currently
>> do is not actually necessary per the hardware guys.
> 
> This is IMO insanely complicated.  I much prefer the kind of
> complexity that is more or less deterministic and easy to test to the
> kind of complexity (like this) that only happens in corner cases.
> 
> I see two solutions here:
> 
> 1. Just suck it up and emulate the CALL.  And find a way to write a
> test case so we know it works.
> 
> 2. Find a non-deadlocky way to make the breakpoint handler wait for
> the breakpoint to get removed, without any mucking at all with the
> entry code.  And find a way to write a test case so we know it works.
> (E.g. stick an actual static_call call site *in text_poke_bp()* that
> fires once on boot so that the really awful recursive case gets
> exercised all the time.
> 
> But if we're going to do any mucking with the entry code, let's just
> do the simple mucking to make emulating CALL work.


These two approaches still seem more complicated to me than having a 
trampoline as backup, which is patched dynamically.
 
IIUC, the current pushback against this option is that it makes batching is
more complicated. I am not sure it is that bad, but there are other variants
of this solution, for example using an retpoline-like flow:

BP-handler:
        /* Sets a per-CPU variable to hold the target */ 
        this_cpu_write(interrupted_static_call_target) = 
                                        get_static_call_targets(regs->rip);

        /* Choose the function based in IRQ-disable during interrupt */
        if (regs->flags & X86_EFLAGS_IF) {
                regs->flags &= ~X86_EFLAGS_IF;
                regs->rip = user_handler_IRQ_disabled
        } else {
                regs->rip = user_handler_IRQ_enabled
        }

user_handler_IRQ_disabled:
        push PER_CPU_VAR(interrupted_static_call_target)
        sti  # this one is not needed in the the enabled case
        ret  # sti-blocking prevents preemption before

Obviously, I don’t know how this coexists with CET.

Re: [PATCH v3 0/6] Static calls

Reply via email to