On 3/15/19 5:50 AM, Peter Zijlstra wrote:
> On Mon, Mar 11, 2019 at 04:48:44PM +0000, Lendacky, Thomas wrote:
>> On AMD processors, the detection of an overflowed counter in the NMI
>> handler relies on the current value of the counter. So, for example, to
>> check for overflow on a 48 bit counter, bit 47 is checked to see if it
>> is 1 (not overflowed) or 0 (overflowed).
>>
>> There is currently a race condition present when disabling and then
>> updating the PMC. Increased NMI latency in newer AMD processors makes this
>> race condition more pronounced.
> 
> Increased NMI latency also makes the results less useful :/ What amount
> of skid are we talking about, and is there anything AMD is going to do
> about this?

I haven't looked into the amount of skid, but, yes, the hardware team is
looking at this.

> 
>> If the counter value has overflowed, it is
>> possible to update the PMC value before the NMI handler can run.
> 
> Arguably the WRMSR should sync against the PMI. That is the beahviour
> one would expect.
> 
> Isn't that something you can fix in ucode? And could you very please
> tell the hardware people this is disguisting?

Currently, there's nothing they've found that can be done in ucode for
this. But, yes, they know it's a problem and they're looking at what they
can do.

> 
>> The updated PMC value is not an overflowed value, so when the perf NMI
>> handler does run, it will not find an overflowed counter. This may
>> appear as an unknown NMI resulting in either a panic or a series of
>> messages, depending on how the kernel is configured.
>>
>> To eliminate this race condition, the PMC value must be checked after
>> disabling the counter in x86_pmu_disable_all(), and, if overflowed, must
>> wait for the NMI handler to reset the value before continuing. Add a new,
>> optional, callable function that can be used to test for and resolve this
>> condition.
>>
>> Cc: <[email protected]> # 4.14.x-
>> Signed-off-by: Tom Lendacky <[email protected]>
> 
>> +static void amd_pmu_wait_on_overflow(int idx, u64 config)
>> +{
>> +    unsigned int i;
>> +    u64 counter;
>> +
>> +    /*
>> +     * We shouldn't be calling this from NMI context, but add a safeguard
>> +     * here to return, since if we're in NMI context we can't wait for an
>> +     * NMI to reset an overflowed counter value.
>> +     */
>> +    if (in_nmi())
>> +            return;
>> +
>> +    /*
>> +     * If the interrupt isn't enabled then we won't get the NMI that will
>> +     * reset the overflow condition, so return.
>> +     */
>> +    if (!(config & ARCH_PERFMON_EVENTSEL_INT))
>> +            return;
>> +
>> +    /*
>> +     * Wait for the counter to be reset if it has overflowed. This loop
>> +     * should exit very, very quickly, but just in case, don't wait
>> +     * forever...
>> +     */
>> +    for (i = 0; i < OVERFLOW_WAIT_COUNT; i++) {
>> +            rdmsrl(x86_pmu_event_addr(idx), counter);
>> +            if (counter & (1ULL << (x86_pmu.cntval_bits - 1)))
>> +                    break;
>> +
>> +            /* Might be in IRQ context, so can't sleep */
>> +            udelay(1);
>> +    }
>> +}
> 
> Argh.. that's horrible, as I'm sure you fully appreciate :/

Yeah...

> 
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index b684f0294f35..f1d2f70000cd 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -606,6 +606,9 @@ void x86_pmu_disable_all(void)
>>                      continue;
>>              val &= ~ARCH_PERFMON_EVENTSEL_ENABLE;
>>              wrmsrl(x86_pmu_config_addr(idx), val);
>> +
>> +            if (x86_pmu.wait_on_overflow)
>> +                    x86_pmu.wait_on_overflow(idx, val);
>>      }
>>   }
> 
> One alternative is adding amd_pmu_disable_all() to amd/core.c and using
> that. Then you can also change the loop to do the wait after all the
> WRMSRs, if that helps with latency.

I thought about that for both this and the next patch. But since it would
be duplicating all the code I went with the added callbacks. It might be
worth it for this patch to have an AMD disable_all callback since it's
not a lot of code to duplicate compared to handle_irq and I like the idea
of doing the overflow check after all the WRMSRs.

Thanks for the review, Peter.

Tom

> 

Reply via email to