Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)

2015-12-09 Thread Aurelien Jarno
On 2015-12-08 22:30, Jelle Haandrikman wrote:
> Hi Andreas,
> 
> On 2015-12-08 19:25, Andreas Beckmann wrote:
> >Hi Aurelien,
> >
> >... buggy software (#807244), which is only exposed by running on
> >hardware with working TSX-NI.
> >That could also explain the fact that the bug was introduced in 352+.
> >
> >Jelle, I didn't dig through the nvidia forums, but if this info isn't
> >mentioned there already, maybe you could post it:
> >
> >>According to the backtrace the problem is typical of a call to
> >>mutex_unlock() on a mutex which hasn't been locked with mutex_lock()
> >>before.
> >(or was already unlocked.)
> I'm not a member of any of any Nvidia forum. I'm more of an advanced
> Debian user, with a technical background as a tester. All the searches that
> I
> just did regarding mutex_unlock() and the driver point back to this post.
> 
> You really are doing the best anaylysis I had found. Unfortunately it's also
> the only one I can find.

As often this can be also found on the archlinux bug tracking system:

https://bugs.archlinux.org/task/46064?project=1

There is even a link to an ugly patch showing that the issue has been
understood. Finally according to the last post in this bug entry it
seems that nvidia is about to release fix.

Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net



Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)

2015-12-08 Thread Aurelien Jarno
Hi,

On 2015-12-07 23:26, Andreas Beckmann wrote:
> Dear libc maintainers,
> 
> we recently got a bug report regarding the TSX-NI / lock elision bug in
> combination with the non-free nvidia driver (#807244). Since that is
> supposed to be fixed with the libc in experimental (and now sid as
> well), perhaps you could take a look why this still happens.
> Several forum posts denote that "compiling glibc without
> --enable-lock-elision" works around that issue.

I disagree it is supposed to be fixed. Intel got a few bugs in there
TSX-NI implementation for Haswell and Broadwell and possibly early
versions of Skylake, and to avoid data loss we have therefore disabled
lock elision for some CPU revisions. That said the bugs in the Intel
implementation are corner cases, and it took quite some time for them to
get discovered. If your program crashes reproducibly, it's definitely not
an issue with the TSX-NI implementation. Disabling --enable-lock-elision
it's just a workaround for the real issue. People now start to have CPUs
with a working TSX-NI implementation which is therefore not blacklisted
and thus the problem is appearing again.

> A few ideas from my side, but since I don't have the hardware to test, I
> cannot check anything:
> * that specific CPU needs to be blacklisted / is incorrectly whitelisted

As said above that couldn't be that.

> * nvidia utilizes a code path in libc that is not covered by the current
> patch (and that code path is not used by any other application)
> * nvidia does call something it shouldn't call directly ... thus
> circumenting the runtime-disabling of the specific routines in libc6

According to the backtrace the problem is typical of a call to
mutex_unlock() on a mutex which hasn't been locked with mutex_lock()
before. Nvidia should fix the bug there.

> * nvidia code does issue the problematic instructions itself (but the
> backtrace points to libc, so this sounds unlikely)
> 
> Is there some way to check at runtime how lock elision is handled by
> libc (on a concrete system)?

What do you mean by "how is it handled"? I have attached a small program
which demonstrate the issue. You can use it to check if your system is
using lock elision or not. Running this program with ltrace it's quite
easy the call to an already unlocked mutex. I wonder if it's doable to
do the same with the whole Nvidia stack.

Aurelien

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net
/* compile with gcc -o mutex_crash_tsx mutex_crash_tsx.c -lpthread */

#include 

int main()
{
pthread_mutex_t m =  PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock();
pthread_mutex_unlock();
pthread_mutex_unlock();
}


Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)

2015-12-08 Thread Andreas Beckmann
Hi Aurelien,

thanks for your analysis.

On 2015-12-08 10:23, Aurelien Jarno wrote:
> I disagree it is supposed to be fixed. Intel got a few bugs in there
> TSX-NI implementation for Haswell and Broadwell and possibly early
> versions of Skylake, and to avoid data loss we have therefore disabled
> lock elision for some CPU revisions.

That's what I meant with "fixed". But obviously there are two problems
here: buggy hardware (blacklisted, #800574) and ...

> That said the bugs in the Intel
> implementation are corner cases, and it took quite some time for them to
> get discovered. If your program crashes reproducibly, it's definitely not
> an issue with the TSX-NI implementation. Disabling --enable-lock-elision
> it's just a workaround for the real issue. People now start to have CPUs
> with a working TSX-NI implementation which is therefore not blacklisted
> and thus the problem is appearing again.

... buggy software (#807244), which is only exposed by running on
hardware with working TSX-NI.
That could also explain the fact that the bug was introduced in 352+.

Jelle, I didn't dig through the nvidia forums, but if this info isn't
mentioned there already, maybe you could post it:

> According to the backtrace the problem is typical of a call to
> mutex_unlock() on a mutex which hasn't been locked with mutex_lock()
> before.
(or was already unlocked.)


Andreas



Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)

2015-12-08 Thread Jelle Haandrikman

Hi Andreas,

On 2015-12-08 19:25, Andreas Beckmann wrote:

Hi Aurelien,

... buggy software (#807244), which is only exposed by running on
hardware with working TSX-NI.
That could also explain the fact that the bug was introduced in 352+.

Jelle, I didn't dig through the nvidia forums, but if this info isn't
mentioned there already, maybe you could post it:


According to the backtrace the problem is typical of a call to
mutex_unlock() on a mutex which hasn't been locked with mutex_lock()
before.

(or was already unlocked.)

I'm not a member of any of any Nvidia forum. I'm more of an advanced
Debian user, with a technical background as a tester. All the searches 
that I
just did regarding mutex_unlock() and the driver point back to this 
post.


You really are doing the best anaylysis I had found. Unfortunately it's 
also

the only one I can find.

Thanks for already doing this investigation.

best regards,
Jelle



Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)

2015-12-07 Thread Andreas Beckmann
Dear libc maintainers,

we recently got a bug report regarding the TSX-NI / lock elision bug in
combination with the non-free nvidia driver (#807244). Since that is
supposed to be fixed with the libc in experimental (and now sid as
well), perhaps you could take a look why this still happens.
Several forum posts denote that "compiling glibc without
--enable-lock-elision" works around that issue.

A few ideas from my side, but since I don't have the hardware to test, I
cannot check anything:
* that specific CPU needs to be blacklisted / is incorrectly whitelisted
* nvidia utilizes a code path in libc that is not covered by the current
patch (and that code path is not used by any other application)
* nvidia does call something it shouldn't call directly ... thus
circumenting the runtime-disabling of the specific routines in libc6
* nvidia code does issue the problematic instructions itself (but the
backtrace points to libc, so this sounds unlikely)

Is there some way to check at runtime how lock elision is handled by
libc (on a concrete system)?

Andreas

On 2015-12-06 17:53, Jelle Haandrikman wrote:
> On a system with an Nvidia GTX 970, Intel Skylake i5-6600k running driver
> 352.63-1 (experimental) several programs crash due to TSX-NI / elision unlock.
> This affects sddm, unlocking kscreen, vlc and deleting files using dolphin.
> 
> Other people also have found this issue.
> http://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/nvidia-linux/825702-nvidia-s-latest-binary-driver-is-causing-problems-for-some-skylake-linux-users
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574 #800574
> https://devtalk.nvidia.com/default/topic/893325/newest-and-beta-linux-driver-causing-segmentation-fault-core-dumped-on-all-skylake-platforms/
> 
> Bug #800574 suggest to disable elisian-unlock in glibc. Which is already
> incorporated in experimental. This does not alleviate the issue. See the 
> "steps
> to reproduce" below. The same bug suggests that the nvidia driver still has
> problems. I also run intel-microcode update, but that doesn't solve anything.

> Step to reproduce: gdb vlc
> output:
> (gdb) run
> Starting program: /usr/bin/vlc
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> VLC media player 2.2.1 Terry Pratchett (Weatherwax) (revision 
> 2.2.1-0-ga425c42)
> 
> Program received signal SIGSEGV, Segmentation fault.
> __lll_unlock_elision (lock=0x726d0d08, private=0)
> at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> 29  ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or
> directory.
> (gdb) bt
> #0  __lll_unlock_elision (lock=0x726d0d08, private=0)
> at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29
> #1  0x7247f26c in ?? () from /usr/lib/x86_64-linux-gnu/libEGL.so.1
> #2  0x7240fa22 in ?? () from /usr/lib/x86_64-linux-gnu/libEGL.so.1
> #3  0x7fffd960 in ?? ()
> #4  0x72493ea1 in ?? () from /usr/lib/x86_64-linux-gnu/libEGL.so.1
> #5  0x7fffd960 in ?? ()
> #6  0x77def59e in _dl_close_worker (map=,
> force=)
> at dl-close.c:291
> Backtrace stopped: previous frame inner to this frame (corrupt stack?)
> 
> /usr/lib/x86_64-linux-gnu/libEGL.so.1 -> /usr/lib/x86_64-linux-
> gnu/nvidia/libEGL.so.1
> 
> "dmesg|grep pthread" result:
> breetai@mainbak:~$ dmesg |grep pthread
> [73330.105569] traps: vlc[16815] general protection ip:7f47ac388950
> sp:7ffe3908ad98 error:0 in libpthread-2.22.so[7f47ac376000+18000]
> [78860.282876] traps: dolphin[18294] general protection ip:7fc3b0c1b950
> sp:7ffd0a0828d8 error:0 in libpthread-2.22.so[7fc3b0c09000+18000]
> [90812.515421] traps: krunner[20723] general protection ip:7f930fa19950
> sp:7ffc9b5cd988 error:0 in libpthread-2.22.so[7f930fa07000+18000]
> [90826.164341] traps: akonadi_migrati[21161] general protection 
> ip:7f33b7e39950
> sp:7fff9d61bef8 error:0 in libpthread-2.22.so[7f33b7e27000+18000]
> [92621.782318] traps: vlc[21962] general protection ip:7f4241467950
> sp:7ffd8fa98f68 error:0 in libpthread-2.22.so[7f4241455000+18000]
> breetai@mainbak:~$
> 
> 
> installed packages:
> System runs testing.
> 
> libc6:amd64 2.22-0experimental0 from experimental
> nvidia-driver   352.63-1from experimental
> intel-microcode 3.20151106.1from testing
> vlc 2.2.1-5+b1  from testing