Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)
On 2015-12-08 22:30, Jelle Haandrikman wrote: > Hi Andreas, > > On 2015-12-08 19:25, Andreas Beckmann wrote: > >Hi Aurelien, > > > >... buggy software (#807244), which is only exposed by running on > >hardware with working TSX-NI. > >That could also explain the fact that the bug was introduced in 352+. > > > >Jelle, I didn't dig through the nvidia forums, but if this info isn't > >mentioned there already, maybe you could post it: > > > >>According to the backtrace the problem is typical of a call to > >>mutex_unlock() on a mutex which hasn't been locked with mutex_lock() > >>before. > >(or was already unlocked.) > I'm not a member of any of any Nvidia forum. I'm more of an advanced > Debian user, with a technical background as a tester. All the searches that > I > just did regarding mutex_unlock() and the driver point back to this post. > > You really are doing the best anaylysis I had found. Unfortunately it's also > the only one I can find. As often this can be also found on the archlinux bug tracking system: https://bugs.archlinux.org/task/46064?project=1 There is even a link to an ugly patch showing that the issue has been understood. Finally according to the last post in this bug entry it seems that nvidia is about to release fix. Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net
Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)
Hi, On 2015-12-07 23:26, Andreas Beckmann wrote: > Dear libc maintainers, > > we recently got a bug report regarding the TSX-NI / lock elision bug in > combination with the non-free nvidia driver (#807244). Since that is > supposed to be fixed with the libc in experimental (and now sid as > well), perhaps you could take a look why this still happens. > Several forum posts denote that "compiling glibc without > --enable-lock-elision" works around that issue. I disagree it is supposed to be fixed. Intel got a few bugs in there TSX-NI implementation for Haswell and Broadwell and possibly early versions of Skylake, and to avoid data loss we have therefore disabled lock elision for some CPU revisions. That said the bugs in the Intel implementation are corner cases, and it took quite some time for them to get discovered. If your program crashes reproducibly, it's definitely not an issue with the TSX-NI implementation. Disabling --enable-lock-elision it's just a workaround for the real issue. People now start to have CPUs with a working TSX-NI implementation which is therefore not blacklisted and thus the problem is appearing again. > A few ideas from my side, but since I don't have the hardware to test, I > cannot check anything: > * that specific CPU needs to be blacklisted / is incorrectly whitelisted As said above that couldn't be that. > * nvidia utilizes a code path in libc that is not covered by the current > patch (and that code path is not used by any other application) > * nvidia does call something it shouldn't call directly ... thus > circumenting the runtime-disabling of the specific routines in libc6 According to the backtrace the problem is typical of a call to mutex_unlock() on a mutex which hasn't been locked with mutex_lock() before. Nvidia should fix the bug there. > * nvidia code does issue the problematic instructions itself (but the > backtrace points to libc, so this sounds unlikely) > > Is there some way to check at runtime how lock elision is handled by > libc (on a concrete system)? What do you mean by "how is it handled"? I have attached a small program which demonstrate the issue. You can use it to check if your system is using lock elision or not. Running this program with ltrace it's quite easy the call to an already unlocked mutex. I wonder if it's doable to do the same with the whole Nvidia stack. Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net /* compile with gcc -o mutex_crash_tsx mutex_crash_tsx.c -lpthread */ #include int main() { pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_lock(); pthread_mutex_unlock(); pthread_mutex_unlock(); }
Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)
Hi Aurelien, thanks for your analysis. On 2015-12-08 10:23, Aurelien Jarno wrote: > I disagree it is supposed to be fixed. Intel got a few bugs in there > TSX-NI implementation for Haswell and Broadwell and possibly early > versions of Skylake, and to avoid data loss we have therefore disabled > lock elision for some CPU revisions. That's what I meant with "fixed". But obviously there are two problems here: buggy hardware (blacklisted, #800574) and ... > That said the bugs in the Intel > implementation are corner cases, and it took quite some time for them to > get discovered. If your program crashes reproducibly, it's definitely not > an issue with the TSX-NI implementation. Disabling --enable-lock-elision > it's just a workaround for the real issue. People now start to have CPUs > with a working TSX-NI implementation which is therefore not blacklisted > and thus the problem is appearing again. ... buggy software (#807244), which is only exposed by running on hardware with working TSX-NI. That could also explain the fact that the bug was introduced in 352+. Jelle, I didn't dig through the nvidia forums, but if this info isn't mentioned there already, maybe you could post it: > According to the backtrace the problem is typical of a call to > mutex_unlock() on a mutex which hasn't been locked with mutex_lock() > before. (or was already unlocked.) Andreas
Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)
Hi Andreas, On 2015-12-08 19:25, Andreas Beckmann wrote: Hi Aurelien, ... buggy software (#807244), which is only exposed by running on hardware with working TSX-NI. That could also explain the fact that the bug was introduced in 352+. Jelle, I didn't dig through the nvidia forums, but if this info isn't mentioned there already, maybe you could post it: According to the backtrace the problem is typical of a call to mutex_unlock() on a mutex which hasn't been locked with mutex_lock() before. (or was already unlocked.) I'm not a member of any of any Nvidia forum. I'm more of an advanced Debian user, with a technical background as a tester. All the searches that I just did regarding mutex_unlock() and the driver point back to this post. You really are doing the best anaylysis I had found. Unfortunately it's also the only one I can find. Thanks for already doing this investigation. best regards, Jelle
Bug#800574: Bug#807244: libegl1-nvidia: Programs crash due to elisian-unlock on skylake processor with nvidia driver 352.63-1 (experimental)
Dear libc maintainers, we recently got a bug report regarding the TSX-NI / lock elision bug in combination with the non-free nvidia driver (#807244). Since that is supposed to be fixed with the libc in experimental (and now sid as well), perhaps you could take a look why this still happens. Several forum posts denote that "compiling glibc without --enable-lock-elision" works around that issue. A few ideas from my side, but since I don't have the hardware to test, I cannot check anything: * that specific CPU needs to be blacklisted / is incorrectly whitelisted * nvidia utilizes a code path in libc that is not covered by the current patch (and that code path is not used by any other application) * nvidia does call something it shouldn't call directly ... thus circumenting the runtime-disabling of the specific routines in libc6 * nvidia code does issue the problematic instructions itself (but the backtrace points to libc, so this sounds unlikely) Is there some way to check at runtime how lock elision is handled by libc (on a concrete system)? Andreas On 2015-12-06 17:53, Jelle Haandrikman wrote: > On a system with an Nvidia GTX 970, Intel Skylake i5-6600k running driver > 352.63-1 (experimental) several programs crash due to TSX-NI / elision unlock. > This affects sddm, unlocking kscreen, vlc and deleting files using dolphin. > > Other people also have found this issue. > http://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/nvidia-linux/825702-nvidia-s-latest-binary-driver-is-causing-problems-for-some-skylake-linux-users > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574 #800574 > https://devtalk.nvidia.com/default/topic/893325/newest-and-beta-linux-driver-causing-segmentation-fault-core-dumped-on-all-skylake-platforms/ > > Bug #800574 suggest to disable elisian-unlock in glibc. Which is already > incorporated in experimental. This does not alleviate the issue. See the > "steps > to reproduce" below. The same bug suggests that the nvidia driver still has > problems. I also run intel-microcode update, but that doesn't solve anything. > Step to reproduce: gdb vlc > output: > (gdb) run > Starting program: /usr/bin/vlc > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > VLC media player 2.2.1 Terry Pratchett (Weatherwax) (revision > 2.2.1-0-ga425c42) > > Program received signal SIGSEGV, Segmentation fault. > __lll_unlock_elision (lock=0x726d0d08, private=0) > at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29 > 29 ../sysdeps/unix/sysv/linux/x86/elision-unlock.c: No such file or > directory. > (gdb) bt > #0 __lll_unlock_elision (lock=0x726d0d08, private=0) > at ../sysdeps/unix/sysv/linux/x86/elision-unlock.c:29 > #1 0x7247f26c in ?? () from /usr/lib/x86_64-linux-gnu/libEGL.so.1 > #2 0x7240fa22 in ?? () from /usr/lib/x86_64-linux-gnu/libEGL.so.1 > #3 0x7fffd960 in ?? () > #4 0x72493ea1 in ?? () from /usr/lib/x86_64-linux-gnu/libEGL.so.1 > #5 0x7fffd960 in ?? () > #6 0x77def59e in _dl_close_worker (map=, > force=) > at dl-close.c:291 > Backtrace stopped: previous frame inner to this frame (corrupt stack?) > > /usr/lib/x86_64-linux-gnu/libEGL.so.1 -> /usr/lib/x86_64-linux- > gnu/nvidia/libEGL.so.1 > > "dmesg|grep pthread" result: > breetai@mainbak:~$ dmesg |grep pthread > [73330.105569] traps: vlc[16815] general protection ip:7f47ac388950 > sp:7ffe3908ad98 error:0 in libpthread-2.22.so[7f47ac376000+18000] > [78860.282876] traps: dolphin[18294] general protection ip:7fc3b0c1b950 > sp:7ffd0a0828d8 error:0 in libpthread-2.22.so[7fc3b0c09000+18000] > [90812.515421] traps: krunner[20723] general protection ip:7f930fa19950 > sp:7ffc9b5cd988 error:0 in libpthread-2.22.so[7f930fa07000+18000] > [90826.164341] traps: akonadi_migrati[21161] general protection > ip:7f33b7e39950 > sp:7fff9d61bef8 error:0 in libpthread-2.22.so[7f33b7e27000+18000] > [92621.782318] traps: vlc[21962] general protection ip:7f4241467950 > sp:7ffd8fa98f68 error:0 in libpthread-2.22.so[7f4241455000+18000] > breetai@mainbak:~$ > > > installed packages: > System runs testing. > > libc6:amd64 2.22-0experimental0 from experimental > nvidia-driver 352.63-1from experimental > intel-microcode 3.20151106.1from testing > vlc 2.2.1-5+b1 from testing