Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
Hi! I have this issue with cron-apt: 8< /etc/cron.daily/apt-compat: Illegal instruction R: /etc/cron.daily/apt-compat: 44: arithmetic expression: expecting primary: " % 32767 " run-parts: /etc/cron.daily/apt-compat exited with return code 2 >8 A change in the config file of the domU did solve this problem (not really a solution to the problem but maybe a useful hint): old domU.cfg: kernel = '/usr/lib/grub-xen/grub-x86_64-xen.bin' extra = 'elevator=noop' new domU.cfg: type= 'pvh' kernel = '/usr/lib/grub-xen/grub-i386-xen_pvh.bin' extra = 'elevator=noop' And i get a different output from `cpuid -1` for these configs. Markus
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
Hi, Pádraig Brady wrote: > At this stage it would be good to get the output from `cpuid -1` Ok, I've attached the output of "cpuid -1" from both affected DomUs (the outputs slightly differ) as well as of the unaffected hosting server (same CPU) for comparison. cpuid-domu1.txt and cpuid-domu2.txt is the output on the two affected DomUs (VMs) and cpuid-dom0.txt is the output on the (Debian 11) Xen hosting server. One more note: The Xen version running on the hosting server is 4.14.5+94-ge49571868d-1 (the one from Debian 11), in case that's of interest. HTH! Regards, Axel -- ,''`. | Axel Beckert , https://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 `-| 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE cpuid-domu1.txt.gz Description: application/gzip cpuid-domu2.txt.gz Description: application/gzip cpuid-dom0.txt.gz Description: application/gzip signature.asc Description: PGP signature
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
On 13/06/2023 09:38, Axel Beckert wrote: Hi, especially to Pádraig, Pádraig Brady wrote: cksum since v9.0 checks at runtime whether pclmul is supported. It seems that check is not working appropriately on a Xen DomU. The routine in question is pclmul_supported() at: https://github.com/coreutils/coreutils/blob/b841f111/src/cksum.c#L160-L191 That either suggests xen is incorrectly setting PCLMUL and AVX bits, or perhaps these two bits are not sufficient. Hmm I wonder do we also need to explicitly check for SSSE3 support? I.e. I wonder does cksum built with the following help? […] diff --git a/src/cksum.c b/src/cksum.c index 85afab0ac..98733dadf 100644 --- a/src/cksum.c +++ b/src/cksum.c @@ -172,7 +172,7 @@ pclmul_supported (void) return false; } - if (! (ecx & bit_PCLMUL) || ! (ecx & bit_AVX)) + if (! (ecx & bit_PCLMUL) || ! (ecx & bit_AVX) || ! (ecx & bit_SSSE3)) { if (cksum_debug) error (0, 0, "%s", _("pclmul support not detected")); No, the patch unfortunately didn't help: It's great you can test changes at least. Thanks for trying the above. At this stage it would be good to get the output from `cpuid -1` so that hopefully we can get something there to key on that indicates the cpu doesn't support the instructions. thanks, Pádraig
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
Hi, especially to Pádraig, I wrote: > Control: affects -1 aptitude-robot JFYI: The fix for /etc/cron.daily/aptitude-robot, which triggers the issue in its non-bash compatibility mode, is to change its shebang line from "#!/bin/sh" to "#!/bin/bash". (dpkg-reconfigure dash or bash to switch /bin/sh to bash unfortunately no more works since Debian 12. Cc'ing andrewsh@d.o for that comment.) Pádraig Brady wrote: > cksum since v9.0 checks at runtime whether pclmul is supported. > It seems that check is not working appropriately on a Xen DomU. > The routine in question is pclmul_supported() at: > https://github.com/coreutils/coreutils/blob/b841f111/src/cksum.c#L160-L191 > > That either suggests xen is incorrectly setting PCLMUL and AVX bits, > or perhaps these two bits are not sufficient. > Hmm I wonder do we also need to explicitly check for SSSE3 support? > > I.e. I wonder does cksum built with the following help? […] > diff --git a/src/cksum.c b/src/cksum.c > index 85afab0ac..98733dadf 100644 > --- a/src/cksum.c > +++ b/src/cksum.c > @@ -172,7 +172,7 @@ pclmul_supported (void) >return false; > } > > - if (! (ecx & bit_PCLMUL) || ! (ecx & bit_AVX)) > + if (! (ecx & bit_PCLMUL) || ! (ecx & bit_AVX) || ! (ecx & bit_SSSE3)) > { >if (cksum_debug) > error (0, 0, "%s", _("pclmul support not detected")); No, the patch unfortunately didn't help: # dpkg -l coreutils Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name VersionArchitecture Description +++-==-==--= ii coreutils 9.1-1+abetest1 amd64GNU core utilities # while :; do dd if=/dev/urandom count=1 2> /dev/null | cksum ; done Illegal instruction 3835785655 512 1264218280 512 1265063674 512 3358845510 512 3390842004 512 658376191 512 3092360732 512 57993113 512 4257983404 512 2816803635 512 4082554882 512 1183251249 512 3097645355 512 3238771197 512 229543 512 3714227940 512 3331192910 512 1805379772 512 2540013463 512 294869588 512 222826476 512 1622837079 512 2515049677 512 3855944559 512 4031692020 512 4041321365 512 1802184575 512 2031964685 512 2781701490 512 460914961 512 Illegal instruction 3835252621 512 412678137 512 200496131 512 194185340 512 3286885624 512 Illegal instruction 2202092457 512 418097046 512 2216824095 512 3861063118 512 4214986749 512 259193791 512 2169514763 512 892443556 512 705097717 512 1758684834 512 2206099568 512 1780257589 512 82224867 512 Illegal instruction 2247709549 512 […] Thanks for trying to find a patch anyways. > # while :; do dd if=/dev/urandom count=1 2> /dev/null | cksum ; done > 1758277878 512 > 2101634611 512 > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > Illegal instruction > 2704754638 512 > Illegal instruction > 4028135672 512 > 2625667858 512 > Illegal instruction > Illegal instruction > Illegal instruction One weird thing: The "Illegal instruction" happens much more seldom today on the second affected DomU with the patched cksum (above) as well as with the unpatched cksum (below), not sure why. Maybe this also gives some hint on where to look for the cause of this issue. 1829747093 512 Illegal instruction 198577731 512 428043084 512 3695864207 512 2965121539 512 1048852751 512 3278958013 512 Illegal instruction 1852035202 512 2493300527 512 2163958493 512 1863124891 512 2734183826 512 1004299335 512 3257604044 512 1233477715 512 1720570219 512 3013835401 512 3175649825 512 1828643038 512 3146557230 512 911790943 512 1016865138 512 3033781151 512 Illegal instruction 2243248050 512 The DomU on which I initially discovered the issue still hits it as hard as before or maybe even harder now: […] Illegal instruction 3963278313 512 Illegal instruction 118145379 512 211261244 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction 1435849033 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
Hi Kristoffer, Kristoffer Brånemyr wrote: > But I think it's a bit suspicious that it only crashes sometimes.If > there was some instruction which causes this, should it not happen > everytime? Good point. > Can you reproduce the problem running cksum in gdb? Yes: # dd if=/dev/urandom count=1 2> /dev/null | gdb -ex run -ex bt -batch cksum [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Program received signal SIGILL, Illegal instruction. 0x5556ccf5 in cksum_pclmul (fp=0x77faca80 <_IO_2_1_stdin_>, crc_out=0x7fffe8d0, length_out=0x7fffe8c8) at src/cksum_pclmul.c:59 59 src/cksum_pclmul.c: No such file or directory. #0 0x5556ccf5 in cksum_pclmul (fp=0x77faca80 <_IO_2_1_stdin_>, crc_out=0x7fffe8d0, length_out=0x7fffe8c8) at src/cksum_pclmul.c:59 #1 0xabb0 in crc_sum_stream (stream=0x77faca80 <_IO_2_1_stdin_>, resstream=0x7fffe9f0, length=0x7fffe9e8) at src/cksum.c:269 #2 0x7eaa in digest_file (filename=filename@entry=0x5556d14f "-", bin_result=bin_result@entry=0x7fffe9f0 '/' , "\377", missing=missing@entry=0x7fffe9e0, length=length@entry=0x7fffe9e8, binary=) at src/digest.c:945 #3 0x71c7 in main (argc=1, argv=) at src/digest.c:1504 Does this help? Regards, Axel -- ,''`. | Axel Beckert , https://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 `-| 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
I guess it doesn't hurt to try to also check for SSE variants in the function trying to see if pclmul is supported. But I think it's a bit suspicious that it only crashes sometimes.If there was some instruction which causes this, should it not happen everytime? Could it be something else, like some unaligned address read/write that causes this?I guess ILL_ILLOPN might mean the argument to a instruction (i.e. possibly address?) Can you reproduce the problem running cksum in gdb? Then you could disassemble the location it crashes in and possibly see a bit better what causes the issue. Also dump the values of the hardware registers. And variables if you can. -- /Kristoffer Brånemyr Den måndag 12 juni 2023 kl. 15:03:11 CEST, Philip Rowlands skrev: On Sat, 10 Jun 2023, at 11:09, Pádraig Brady wrote: > cksum since v9.0 checks at runtime whether pclmul is supported. > It seems that check is not working appropriately on a Xen DomU. Hypervisors routinely lie about CPUID feature flags, in order to maintain compatibility between a fleet of diverse servers. It's possible in this case that the system was misconfigured to present flags which the underlying CPU doesn't support. > The routine in question is pclmul_supported() at: > https://github.com/coreutils/coreutils/blob/b841f111/src/cksum.c#L160-L191 > > That either suggests xen is incorrectly setting PCLMUL and AVX bits, > or perhaps these two bits are not sufficient. > Hmm I wonder do we also need to explicitly check for SSSE3 support? Intel says to check for SSE and SSE2; quoting the manual === 11.6.2 Checking for Intel® SSE and SSE2 Support Before an application attempts to use Intel SSE and/or Intel SSE2, it should check that they are present on the processor: 1. Check that the processor supports the CPUID instruction. Bit 21 of the EFLAGS register can be used to check processor’s support the CPUID instruction. 2. Check that the processor supports Intel SSE and/or SSE2 (true if CPUID.01H:EDX.SSE[bit 25] = 1 and/or CPUID.01H:EDX.SSE2[bit 26] = 1). 12.13.4 Checking for Intel® AES-NI Support Before an application attempts to use AESNI instructions or PCLMULQDQ, the application should follow the steps illustrated in Section 11.6.2, “Checking for Intel® SSE and SSE2 Support.” Next, use the additional step provided below: Check that the processor supports Intel AES-NI (if CPUID.01H:ECX.AESNI[bit 25] = 1); check that the processor supports PCLMULQDQ (if CPUID.01H:ECX.PCLMULQDQ[bit 1] = 1). === Wikipedia mentions an AVX-512 version (VPCLMULQDQ) but I don't think we're using that. I can't find the equivalent AMD docs. Is there a library / macro check for this, to avoid the low-level bit inspection? It would be useful to see the output of "cpuid -1" which does a verbose decode of all CPUID flags, on the system which sees the SIGILL. (How can it be intermittent??) Interesting that the strace output finishes with: read(0, "", 61440) = 0 --- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x55bec9cc6cf5} --- +++ killed by SIGILL +++ i.e. ILL_ILLOPN (operand) rather than ILL_ILLOPC (opcode). What could cause this? Cheers, Phil
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
On Sat, 10 Jun 2023, at 11:09, Pádraig Brady wrote: > cksum since v9.0 checks at runtime whether pclmul is supported. > It seems that check is not working appropriately on a Xen DomU. Hypervisors routinely lie about CPUID feature flags, in order to maintain compatibility between a fleet of diverse servers. It's possible in this case that the system was misconfigured to present flags which the underlying CPU doesn't support. > The routine in question is pclmul_supported() at: > https://github.com/coreutils/coreutils/blob/b841f111/src/cksum.c#L160-L191 > > That either suggests xen is incorrectly setting PCLMUL and AVX bits, > or perhaps these two bits are not sufficient. > Hmm I wonder do we also need to explicitly check for SSSE3 support? Intel says to check for SSE and SSE2; quoting the manual === 11.6.2 Checking for Intel® SSE and SSE2 Support Before an application attempts to use Intel SSE and/or Intel SSE2, it should check that they are present on the processor: 1. Check that the processor supports the CPUID instruction. Bit 21 of the EFLAGS register can be used to check processor’s support the CPUID instruction. 2. Check that the processor supports Intel SSE and/or SSE2 (true if CPUID.01H:EDX.SSE[bit 25] = 1 and/or CPUID.01H:EDX.SSE2[bit 26] = 1). 12.13.4 Checking for Intel® AES-NI Support Before an application attempts to use AESNI instructions or PCLMULQDQ, the application should follow the steps illustrated in Section 11.6.2, “Checking for Intel® SSE and SSE2 Support.” Next, use the additional step provided below: Check that the processor supports Intel AES-NI (if CPUID.01H:ECX.AESNI[bit 25] = 1); check that the processor supports PCLMULQDQ (if CPUID.01H:ECX.PCLMULQDQ[bit 1] = 1). === Wikipedia mentions an AVX-512 version (VPCLMULQDQ) but I don't think we're using that. I can't find the equivalent AMD docs. Is there a library / macro check for this, to avoid the low-level bit inspection? It would be useful to see the output of "cpuid -1" which does a verbose decode of all CPUID flags, on the system which sees the SIGILL. (How can it be intermittent??) Interesting that the strace output finishes with: read(0, "", 61440) = 0 --- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x55bec9cc6cf5} --- +++ killed by SIGILL +++ i.e. ILL_ILLOPN (operand) rather than ILL_ILLOPC (opcode). What could cause this? Cheers, Phil
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
On 09/06/2023 18:40, Axel Beckert wrote: Package: coreutils Version: 9.1-1 Severity: important X-Debbugs-Cc: a...@debian.org Control: affects -1 aptitude-robot On a Xen DomU running Debian 12, cksum intermittently crashes as follows: # while :; do dd if=/dev/urandom count=1 2> /dev/null | cksum ; done 1758277878 512 2101634611 512 Illegal instruction So to summarise * Debian 12 in Xen DomU exihibits this behaviour. * Debian 11 in Xen DomU on same Dom0 does not exihibit this behaviour. * The Xen Dom0 (Debian 11 though) itself does not exihibit this behaviour. * A Debian 12 installation on bare metal with the same CPU ("AMD EPYC 7313P 16-Core Processor") as the Dom0 does not exhibit this behaviour. Hence some more details about the system: * cksum --debug says: "cksum: using pclmul hardware support" * amd64-microcode on the Dom0 is at 3.20191218.1 cksum since v9.0 checks at runtime whether pclmul is supported. It seems that check is not working appropriately on a Xen DomU. The routine in question is pclmul_supported() at: https://github.com/coreutils/coreutils/blob/b841f111/src/cksum.c#L160-L191 That either suggests xen is incorrectly setting PCLMUL and AVX bits, or perhaps these two bits are not sufficient. Hmm I wonder do we also need to explicitly check for SSSE3 support? I.e. I wonder does cksum built with the following help? BTW it would be worth checking if ssse3 is mentioned in /proc/cpuinfo also. If it was NOT, then there would be more of a chance of that change helping. diff --git a/src/cksum.c b/src/cksum.c index 85afab0ac..98733dadf 100644 --- a/src/cksum.c +++ b/src/cksum.c @@ -172,7 +172,7 @@ pclmul_supported (void) return false; } - if (! (ecx & bit_PCLMUL) || ! (ecx & bit_AVX)) + if (! (ecx & bit_PCLMUL) || ! (ecx & bit_AVX) || ! (ecx & bit_SSSE3)) { if (cksum_debug) error (0, 0, "%s", _("pclmul support not detected"));
Bug#1037264: cksum crashes intermittently with "Illegal instruction" on some Xen DomU
Package: coreutils Version: 9.1-1 Severity: important X-Debbugs-Cc: a...@debian.org Control: affects -1 aptitude-robot On a Xen DomU running Debian 12, cksum intermittently crashes as follows: # while :; do dd if=/dev/urandom count=1 2> /dev/null | cksum ; done 1758277878 512 2101634611 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction 2704754638 512 Illegal instruction 4028135672 512 2625667858 512 Illegal instruction Illegal instruction Illegal instruction 3923394050 512 3125973555 512 Illegal instruction Illegal instruction Illegal instruction 4259853375 512 Illegal instruction Illegal instruction 81698826 512 Illegal instruction 3571110616 512 Illegal instruction 1587881588 512 Illegal instruction Illegal instruction Illegal instruction 2814380057 512 Illegal instruction Illegal instruction 2944809052 512 Illegal instruction 2902358677 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction 935279575 512 Illegal instruction 456315694 512 Illegal instruction 469377998 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction 2550807941 512 Illegal instruction 3392916458 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction 2092884162 512 Illegal instruction 3196356363 512 Illegal instruction 1701279083 512 Illegal instruction 1118990197 512 Illegal instruction 1455432166 512 Illegal instruction Illegal instruction 3772213637 512 Illegal instruction 3359021443 512 Illegal instruction 1472208906 512 Illegal instruction Illegal instruction Illegal instruction 530110239 512 1124879907 512 Illegal instruction 2364080335 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction 1306677535 512 Illegal instruction 2367703624 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction 3730416712 512 Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction Illegal instruction 265751591 512 3833668362 512 Illegal instruction Illegal instruction 1086945333 512 Illegal instruction Illegal instruction 3420907443 512 Illegal instruction Illegal instruction Illegal instruction […] I was only able to reproduce this on a single host so far, hence no RC severity. (But feel free to bump to RC. :-) I tried and could NOT reproduce it on: * Debian 11 amd64 on real hardware (Intel(R) Core(TM) i7-6700 CPU; AMD EPYC 7313P 16-Core Processor; many more) * Debian 12 amd64 on real hardware (Intel(R) Core(TM) i7-6700T CPU; AMD EPYC 7742