Bug#800574: backport to sid/stable? (was RE: libc6: lock elision hazard on Intel Broadwell and Skylake)

Carlos Alberto Lopez Perez Mon, 26 Oct 2015 12:15:49 -0700

On 23/10/15 22:10, Henrique de Moraes Holschuh wrote:
> On Fri, Oct 23, 2015, at 11:13, Carlos Alberto Lopez Perez wrote:
>> I was having trouble (crashes with the NVIDIA proprietary driver) on a
>> Debian system with an i7-5775C and libc6=2.19-18+deb8u1 (stable)
> 
> This is very very likely to be braindamage on the NVIDIA driver, though.
> 
> Are you sure that driver is not doing something as idiotic as unlocking
> an already unlocked mutex ?
> 
> The proper fix in that case is _always_ to fix whatever is broken,
> because eventually it will run on something that has working hardware
> lock elision... and crash.
>


I can't know, since I don't have access to the source code of the
driver, neither the debug symbols are available, so any attempt to get a
meaningful backtrace was hopeless.

At first I also thought it was the driver doing something wrong, but
then I found several reports of people with the same cryptic backtrace
than me saying that this was because of the TSX-NI bug of recent Intel
CPUs [1].

And effectively, after upgrading glibc to this one that disables TSX-NI
for broadwell it suddenly works as expected...

So this seems to suggest that effectively TSX-NI is buggy on this CPU.

In any case... Do you know of any program or test that I can run to
check if TSX-NI (both HLE and RTM) is working as it should or is still
buggy on this CPU? That way we can verify better if the problem is in
the CPU or in the driver.

>> I tried first to update the Intel microcode with the "unreleased" 0x13
>> microcode version but it didn't disabled the TSX-NI instructions [1]
>> neither the crashes.
> 
> Mobile Broadwell-H seems to disable TSX, while Desktop Broadwell-H
> doesn't.  That's why we blacklisted the whole thing: inconsistent
> behavior on the same microcode, and that behavior is itself inconsistent
> with the errata sheet that says such processors shouldn't even be able
> to advertise Intel TSX RTM in CPUID.
> 
> At the moment, we don't even know what is wrong with RTM in
> Broadwell/Broadwel-H/Broadwell-DE.  We do know some of what is wrong
> with HLE in Broadwell/-H/-DE (and it is really nasty), but HLE is not
> used by glibc in the first place, and the HLE erratum is supposedly
> worked around somehow (because it is documented to be so on the Xeon
> D-1500/Broadwell-DE) by the batch of microcode updates available in the
> kernel bugzilla bug report mentioned in this bug report.
> 
> Broadwell-H Microcode 0x13 is useful anyway because it fixes other
> critical errata that hangs/oopses the kernel: you box should be a _lot_
> more stable with it.  And at least one person reported that not all
> hangs were fixed by microcode 0x12, thus you probably should use keep
> using microcode 0x13 (or newer, should one become available).
> 

Good to know, thanks for the advice. I will keep using this 0x13
"unofficial" microcode until a new one is out.
I can't keep wondering why Intel is not releasing this :\


Regards!
--------

[1]
https://lists.archlinux.org/pipermail/arch-general/2015-April/038953.html

signature.asc
Description: OpenPGP digital signature

Bug#800574: backport to sid/stable? (was RE: libc6: lock elision hazard on Intel Broadwell and Skylake)

Reply via email to