Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system

Aurelien Jarno Thu, 15 Sep 2022 12:51:39 -0700

Hi,

On 2022-09-15 20:59, debian-bug-rep...@p0358.net wrote:
> > The first thing would be to provide the output of /proc/cpuinfo
> 
> Pasting below (please **NOTE** that "avx2" would normally be there, but is
> currently missing due to this kernel option `clearcpuid=293` with which I
> booted the PC now -- I can **100%** confirm "avx2" was there before, but
> don't want to reboot for now to remove this kernel flag):
> 
> # cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 60
> model name      : Intel(R) Core(TM) i3-4000M CPU @ 2.40GHz
> stepping        : 3
> microcode       : 0x12
> cpu MHz         : 2394.664
> cache size      : 3072 KB
> physical id     : 0
> siblings        : 4
> core id         : 0
> cpu cores       : 2
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 13
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2
> ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 movbe popcnt xsave avx f16c
> rdrand lahf_lm abm cpuid_fault epb invpcid_single pti tpr_shadow vnmi
> flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms invpcid xsaveopt
> dtherm arat pln pts
> vmx flags       : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb
> flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
> bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
> mds swapgs itlb_multihit srbds
> bogomips        : 4789.10
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 39 bits physical, 48 bits virtual
> power management:


Thanks.

> > If you believe the issue is due to AVX2, clearcpuid won't help, as it
> > just clear the corresponding flags from the kernel point of view, but
> > the cpuid instruction will just continue to behave the same. The way to
> > do disable that features at the glibc level is to set the GLIBC_TUNABLES
> > environment variable to "glibc.cpu.hwcaps=-AVX2_Usable".
> 
> This works! Indeed the clearcpuid flags itself on its own did nothing as you
> mentioned too. This workaround is great to know then for the time being.

Great, that's narrowing down the problem.

> > Same from there due to ASLR. It seems to fail in at least two different
> > locations. Do you have some extra lines around, sometimes the kernel
> > dump the addresses around the instruction pointer?
> 
> Generally these lines all followed similar pattern, and there was nothing
> printed below or after, just this single line per crash. I will paste a few
> more below. Isn't the "+15a000" the relative offset in libc .so though? It

The +15a000 is the size of the libc.so.6 mapping in the virtual memory.

> does seem like an oddly round number, but I loaded the library in IDA
> disassembler and the instructions at this offset do seem to be related to
> AVX2 (linking screenshot which I also pasted on the linked GitHub issue)
> (the highlighted instruction in gray seems to be the one at this
> aforementioned offset):
> https://user-images.githubusercontent.com/5182588/190256853-29ae80aa-0089-4da2-a430-990e2693d15c.png
> 
> If my above hypithesis is correct, then I looked at the mother function in
> x-refs and it does seem to be defined in rtld_global_ro table, and its name
> is "__strncmp_avx2". Was something changed in this function between the
> updates?
> 
> Pasting more kernel lines:
> kernel: [852124.361775] traps: dhclient[1583381] trap invalid opcode
> ip:7fe19118051d sp:7ffee6e36238 error:0 in libc-2.31.so[7fe191044000+15a000]
> kernel: [852124.468314] traps: nft[1583398] trap invalid opcode
> ip:7fe3418fe51d sp:7fff11342df8 error:0 in libc-2.31.so[7fe3417c2000+15a000]
> kernel: [852124.572700] traps: systemd-shutdow[1377424] trap invalid opcode
> ip:7fde88b724ad sp:7ffc13767028 error:0 in libc-2.31.so[7fde88a3a000+15a000]
> kernel: [  270.477024] traps: bun[2055] trap invalid opcode ip:2e363f4
> sp:7ffe2320d640 error:0 in bun[2a6f000+2ce2000]
> kernel: [  279.884807] traps: systemd[2115] trap invalid opcode
> ip:7faf645ec4ad sp:7ffe12e06c48 error:0 in libc-2.31.so[7faf644b4000+15a000]
> kernel: [  299.637575] traps: bun[2296] trap invalid opcode ip:2e363f4
> sp:7ffd0c0bc9c0 error:0 in bun[2a6f000+2ce2000]
> kernel: [  331.036417] traps: bash[2462] trap invalid opcode ip:7ff42840051d
> sp:7ffd34ad7278 error:0 in libc-2.31.so[7ff4282c4000+15a000]
> kernel: [  357.184428] traps: bash[2652] trap invalid opcode ip:7f717873751d
> sp:7fffd34c8848 error:0 in libc-2.31.so[7f71785fb000+15a000]
> kernel: [  645.517556] traps: bash[3508] trap invalid opcode ip:7f4b6ee8851d
> sp:7ffd74beb6e8 error:0 in libc-2.31.so[7f4b6ed4c000+15a000]
> kernel: [  876.760209] traps: bash[4225] trap invalid opcode ip:7fd30515a0c4
> sp:7ffc604bb118 error:0 in libc.so.6[7fd30502a000+154000]
> kernel: [  891.000593] traps: bash[4399] trap invalid opcode ip:7f3c25acd0c4
> sp:7fff33adcab8 error:0 in libc.so.6[7f3c2599d000+154000]
> kernel: [ 1245.008705] traps: systemd[5382] trap invalid opcode
> ip:7fe82964f4ad sp:7ffd9967ace8 error:0 in libc-2.31.so[7fe829517000+15a000]
> kernel: [ 1472.084646] traps: systemd[6104] trap invalid opcode
> ip:7fd0316cb4ad sp:7fff24a010b8 error:0 in libc-2.31.so[7fd031593000+15a000]
> kernel: [ 1487.513379] traps: systemd[6269] trap invalid opcode
> ip:7fa89d8354ad sp:7fffdc2b9328 error:0 in libc-2.31.so[7fa89d6fd000+15a000]
> kernel: [ 1541.866530] traps: systemd[7005] trap invalid opcode
> ip:7fbb764d74ad sp:7ffd302b3718 error:0 in libc-2.31.so[7fbb7639f000+15a000]
> kernel: [ 1774.377940] traps: systemd[7750] trap invalid opcode
> ip:7f5db1ae54ad sp:7ffc9ba5ef58 error:0 in libc-2.31.so[7f5db19ad000+15a000]
> kernel: [66259.261517] traps: bash[428087] trap invalid opcode
> ip:7fc5f7364422 sp:7fff81b7f918 error:0 in libc.so.6[7fc5f7234000+16e000]
> kernel: [67322.502174] traps: bash[435709] trap invalid opcode
> ip:7f24abbad422 sp:7ffe428004f8 error:0 in libc.so.6[7f24aba7d000+16e000]
> kernel: [67339.606742] traps: passwd[436152] trap invalid opcode
> ip:7f4f047ce422 sp:7fff59b0f618 error:0 in libc.so.6[7f4f0469e000+16e000]
> kernel: [67433.720656] traps: adduser[437285] trap invalid opcode
> ip:7f0e09f602b7 sp:7fff697e8f98 error:0 in libc-2.31.so[7f0e09e28000+15a000]
> kernel: [67714.117441] traps: bash[439504] trap invalid opcode
> ip:7f96d3a5c0c4 sp:7ffd554b71a8 error:0 in libc.so.6[7f96d392c000+154000]
> 
> Note that this time around they come from different libc compilations:
> - +15a000 one is from Debian Stable (debian:bullseye-20220912-slim docker
> image)
> - +16e000 one is from Debian Sid (debian:sid-slim docker image)
> - +2ce2000 is bun.js, unrelated program that seems to have libc6 statically
> compiled
> - +154000 is from Fedora for a good measure (fedora:rawhide docker image)

As said above, this is basically linked to the size of the libc.so.6
file, or more precisely the part that is mapped into memory. That said
it seems the crash is happening in multiple place by looking at the last
digits of the ip address (knowing that there is randomization of the
exact address due to ASLR).

> So this "+" number being the same in case of same distinct libc builds does
> suggest heavily that it is simply relative instruction offset in the .so
> itself.
> 
> I might be wrong though, I also had no idea where to get debug symbols from,
> and gdb didn't seem to be willing to print any useful information either...
> Do you think I should setup another LXC container and upgrade the libc6
> using this env var workaround and then try running some program under gdb
> itself with this variable cleared? I've never actually used gdb debugger,
> but surely a simple debugging up to a crash couldn't be that hard...

The debug symbols are available in the libc6-dbg package. Basically you
can try to get a shell with the glibc.cpu.hwcaps workaround. Then run 
ulimit -c unlimited to get core files, and execute a binary that fails
this time without the glibc.cpu.hwcaps workaround. You can then examine
the core using gdb binary corefile.

> > The changes that are in this stable release have been (or at least were
> > supposed to, given the bug you reported) in testing/sid for a few
> > months. Are you able to do a test with debian sid, for instance in
> > docker?
> 
> Yes, same story apparently. Btw, I tested similar way in latest Fedora, with
> exact same outcome, so in the end the issue seems not isolated to Debian,
> but libc6 and this particular set of patches that eventually has found its
> way to Debian Stable.
> 
> # docker run -it --rm debian:sid-slim bash
> # echo $?
> 132
> 
> ^ Interestingly, apt is one program that does work on sid, while not working
> on stable.

Ok, thanks. It's interesting it also fails in sid and on Fedora. The
change has been introduced back in February, so it's strange it has not
been noticed yet.

> Looking at this changelog...:
> https://tracker.debian.org/news/1358014/accepted-glibc-231-13deb11u4-source-into-proposed-updates-stable-new-proposed-updates/
> 
> ... is there perhaps some way these changes could be applied one by one to
> pinpoint which one is causing issues that way?

Unfortunately, not that easily unless you want to compile the upstream
sources. As you pointed, the changes are very likely related to the AVX2
changes, so having the address of the illegal instruction would help a
lot to understand the issue.

> This machine, in case it matters, is a Lenovo G510 laptop. There is some
> update available for the BIOS, but it requires booting up Windows to perform
> it. Should I attempt that? I found some ancient thread on some forum that
> mentioned BIOS update fixes some issues with "freezes" on

As said above, I find strange that the problem has not been noticed yet
given it affects at least two distributions, and that it dates from a
few months in sid. You might want to install the intel-microcode package
and reboot to see if it helps, it should have the same effects than
updating the BIOS for the point of view of the current bug.

Regards
Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurel...@aurel32.net                 http://www.aurel32.net

Bug#1019855: Fwd: libc6: immediately crashes with SIGILL on 4th gen Intel Core CPUs (seems related to AVX2 instructions), bricking the whole system

Reply via email to