Re: libc recently more aggressive about pthread locks in stable ?
On 12/11/16 at 18:51 -0200, Henrique de Moraes Holschuh wrote: > Lucas, > > Thanks for trying a build run with TSX enabled. > > On Sat, 12 Nov 2016, Lucas Nussbaum wrote: > > I did an archive rebuild on Amazon EC2 using m4.16xlarge instances, that > > use a CPU with TSX enabled. > > What microcode revision is that Xeon E5-2686 running? microcode: CPU0 sig=0x406f1, pf=0x1, revision=0xb14 (That's just on one node. I'm assuming that all nodes had the same microcode revision, which is probably a reasonable bet) Lucas
Re: libc recently more aggressive about pthread locks in stable ?
Lucas, Thanks for trying a build run with TSX enabled. On Sat, 12 Nov 2016, Lucas Nussbaum wrote: > I did an archive rebuild on Amazon EC2 using m4.16xlarge instances, that > use a CPU with TSX enabled. What microcode revision is that Xeon E5-2686 running? > I've filed bugs for the packages that failed during that rebuild, but > don't fail on m4.large instances: > https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=qa-ftbfs-2016;users=debian...@lists.debian.org We still need that instrumented libc if one is to test applications, though, as most packages have little in the way of automated regression test suites. And people need to test the packages (using the applications) with such an instrumented libc installed (or running on a box with TSX active). -- Henrique Holschuh
Re: libc recently more aggressive about pthread locks in stable ?
On 07/11/16 at 21:52 +0100, Lucas Nussbaum wrote: > Hi, > > On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote: > > On Sun, 06 Nov 2016, Ben Hutchings wrote: > > > It's worth noting that TSX is broken in 'Haswell' processors and is > > > supposed to be disabled via a microcode update. I don't know whether > > > glibc avoids using it on these processors if the microcode update is > > > not applied. (Linux doesn't appear to hide the feature flags.) > > > > It does avoid it. For glibc libpthreads, Debian has blacklisted Intel > > TSX use [in libpthreads] on all of Haswell and much of Broadwell. > > > > But anything else *will* attempt to use it, people query cpuid directly > > for these things. You need a hypervisor that filters cpuid(). > > How can one know what glibc does on a given CPU? (preferably without > access to the hardware) > > I could try to run an archive rebuild on hardware where glibc leverages > TSX to see what happens. I did an archive rebuild on Amazon EC2 using m4.16xlarge instances, that use a CPU with TSX enabled. I've filed bugs for the packages that failed during that rebuild, but don't fail on m4.large instances: https://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=qa-ftbfs-2016;users=debian...@lists.debian.org It's not impossible that some of them are caused by problems with building in parallel, unrelated to TSX. L.
Re: libc recently more aggressive about pthread locks in stable ?
On Mon, 07 Nov 2016, Lucas Nussbaum wrote: > On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote: > > On Sun, 06 Nov 2016, Ben Hutchings wrote: > > > It's worth noting that TSX is broken in 'Haswell' processors and is > > > supposed to be disabled via a microcode update. I don't know whether > > > glibc avoids using it on these processors if the microcode update is > > > not applied. (Linux doesn't appear to hide the feature flags.) > > > > It does avoid it. For glibc libpthreads, Debian has blacklisted Intel > > TSX use [in libpthreads] on all of Haswell and much of Broadwell. > > > > But anything else *will* attempt to use it, people query cpuid directly > > for these things. You need a hypervisor that filters cpuid(). > > How can one know what glibc does on a given CPU? (preferably without > access to the hardware) > > I could try to run an archive rebuild on hardware where glibc leverages > TSX to see what happens. IMHO it would be better to instrument the locks in glibc with asserts, instead. You could use anything to test for pthread API violations, then. That said, if you are going to test Intel TSX for real, you need a Desktop Skylake-based Core i5/i7 or Xeon E3v5 that reports "RTM" in /proc/cpuinfo. Some won't. Not every Skylake model will have it enabled in the first place, and apparently the firmware can (and some _do_) disable it, especially on the mobile side. Please ensure the Skylake firmware has microcode 0x9d/0x9e or later, or install the latest version of the non-free intel-microcode package. The risk of unpredictable behaviour is quite real otherwise, and could mess up the test results (and corrupt data). Skylake errata are a nightmare. Note the AVX, AVX2, eDRAM (L4?), and TSX ones, as well as the power-management ones: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v5-spec-update.pdf http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf Don't attempt to test TSX with perf or intel PT running. perf is likely to cause too many aborts, and Intel PT is an errata hell. As for Broadwell, I don't know which processors would still have TSX enabled in the first place when running the latest microcode, and we blacklist most of them in glibc anyway (because almost all Broadwell-* specification updates list it as either unavailable or unusable), so they're not a very viable option to test this. -- Henrique Holschuh
Re: libc recently more aggressive about pthread locks in stable ?
On 07/11/16 at 21:52 +0100, Lucas Nussbaum wrote: > Hi, > > On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote: > > On Sun, 06 Nov 2016, Ben Hutchings wrote: > > > It's worth noting that TSX is broken in 'Haswell' processors and is > > > supposed to be disabled via a microcode update. I don't know whether > > > glibc avoids using it on these processors if the microcode update is > > > not applied. (Linux doesn't appear to hide the feature flags.) > > > > It does avoid it. For glibc libpthreads, Debian has blacklisted Intel > > TSX use [in libpthreads] on all of Haswell and much of Broadwell. > > > > But anything else *will* attempt to use it, people query cpuid directly > > for these things. You need a hypervisor that filters cpuid(). > > How can one know what glibc does on a given CPU? (preferably without > access to the hardware) Answering myself, the relevant patch is https://sources.debian.net/src/glibc/2.24-5/debian/patches/amd64/local-blacklist-for-Intel-TSX.diff/ Lucas
Re: libc recently more aggressive about pthread locks in stable ?
Hi, On 06/11/16 at 17:41 -0200, Henrique de Moraes Holschuh wrote: > On Sun, 06 Nov 2016, Ben Hutchings wrote: > > It's worth noting that TSX is broken in 'Haswell' processors and is > > supposed to be disabled via a microcode update. I don't know whether > > glibc avoids using it on these processors if the microcode update is > > not applied. (Linux doesn't appear to hide the feature flags.) > > It does avoid it. For glibc libpthreads, Debian has blacklisted Intel > TSX use [in libpthreads] on all of Haswell and much of Broadwell. > > But anything else *will* attempt to use it, people query cpuid directly > for these things. You need a hypervisor that filters cpuid(). How can one know what glibc does on a given CPU? (preferably without access to the hardware) I could try to run an archive rebuild on hardware where glibc leverages TSX to see what happens. Lucas
Re: libc recently more aggressive about pthread locks in stable ?
On Sun, 06 Nov 2016, Adrian Bunk wrote: > On Sun, Nov 06, 2016 at 05:41:34PM -0200, Henrique de Moraes Holschuh wrote: > > On Sun, 06 Nov 2016, Ben Hutchings wrote: > > > It's worth noting that TSX is broken in 'Haswell' processors and is > > > supposed to be disabled via a microcode update. I don't know whether > > > glibc avoids using it on these processors if the microcode update is > > > not applied. (Linux doesn't appear to hide the feature flags.) > > > > It does avoid it. For glibc libpthreads, Debian has blacklisted Intel > > TSX use [in libpthreads] on all of Haswell and much of Broadwell. > > > > But anything else *will* attempt to use it, people query cpuid directly > > for these things. You need a hypervisor that filters cpuid(). > > All users who are using intel-microcode from non-free instead of running > outdated microcode with known errata should be OK here? Last time I checked, it looked like an yes for Skylake as far as Intel TSX is concerned. I don't know about the other processors, such as Broadwell-E. -- Henrique Holschuh
Re: libc recently more aggressive about pthread locks in stable ?
On 2016-11-06 01:12, Henrique de Moraes Holschuh wrote: > On Sat, 05 Nov 2016, Ian Jackson wrote: > > Looking at the code, I think that gs in jessie is plainly violating > > the rules about the use of pthread locks. On my partner's machine, > > Per logs from message #15 on bug #842796: > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15 > > SIGSEGV on __lll_unlock_elision is a signature (IME with very high > confidence) of an attempt to unlock an already unlocked lock while > running under hardware lock elision. > > > Well, unlocking an already unlocked lock is a pthreads API rule > violation, and it is going to crash the process on something that > implements hardware lock elision. > > These would be Intel x86 processors with TSX enabled[1] for Debian > 8/jessie. For Debian 9/stretch and for unstable, I believe it also > includes IBM Power8, and s390x systems -- AFAIK they won't forgive an > attempt to unlock an unlocked lock any more than Intel TSX does. > > [1] Broadwell-E, Skylake, and later processors, as well as Xeon *v5 > processors. I am not sure if we blacklisted any of the Xeon *v4 > or not, and too tired to look their model numbers up right now. > > Unfortunately, when hardware lock elision support was added to glibc > upstream, libpthreads was *not* changed to properly assert() this > forbidden condition on the non-hardware-elision codepaths. Such an > assert() would have given us consistent behavior, thus flushing the bugs > out in the open... at the cost of a performance hit (I have no idea how > severe), and much screaming. This has not been done has it would have a severe performance hit. That said error checking mutexes also exist in GLIBC, and have been designed exactly for that, ie they trade performance for correctness. > To be fair: it is likely nobody upstream had any idea of just how much > code got libpthreads usage wrong... and we certainly didn't know better > in Debian, either. Well, now we're going to find out :-( > > BTW, AFAIK libpthreads still doesn't have any such assert(), so there's > likely a lot of such buggy code in unstable still. This is going to > cause trouble for Debian stretch, too. I don't expect it to be worse than jessie, actually probably better as some of the bugs have been fixed by the various upstreams in the meantime. Also remember that TSX is just making the bug more visible. It means that users without TSX might experience hangs instead. There are actually two "hang bugs" reporting against ghostscript, that could be fixed by fixing the TSX bug. [...] > If the problem is too widespread and too hard to fix on a large number > of packages, I suppose we could ask the glibc maintainers to consider > disabling hardware lock elision support in stable through a stable > update. > > Such a change to glibc would likely requires some patches to ensure it > *really* disabled Intel TSX opcode/instruction insertion, but I think we > already ship all of them as part of the Intel TSX blacklist. The result > would need real-world testing on an up-to-date Skylake box as well as > objdump inspection to ensure *no* TSX-related instructions leaked into > the binaries. We can disable multiarch by passing "--enable-lock-elision". There is no risk that the instructions are leaked into the binaries except of course for static binaries. That said so far we talk about a few packages only. A lot of bugs have already been fixed during the jessie release cycle, I remember sending patches for that. > And what should we do about Debian stretch, then? As said above disabling TSX in glibc is just hidding issues to users. We should instead try to detect as many bugs as possible (possibly fixing the corresponding bugs in jessie). One way would be to get a box with TSX instructions and use it for the reproducible builds and/or the autopkgtests. Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net signature.asc Description: PGP signature
Re: libc recently more aggressive about pthread locks in stable ?
On Sun, Nov 06, 2016 at 05:41:34PM -0200, Henrique de Moraes Holschuh wrote: > On Sun, 06 Nov 2016, Ben Hutchings wrote: > > It's worth noting that TSX is broken in 'Haswell' processors and is > > supposed to be disabled via a microcode update. I don't know whether > > glibc avoids using it on these processors if the microcode update is > > not applied. (Linux doesn't appear to hide the feature flags.) > > It does avoid it. For glibc libpthreads, Debian has blacklisted Intel > TSX use [in libpthreads] on all of Haswell and much of Broadwell. > > But anything else *will* attempt to use it, people query cpuid directly > for these things. You need a hypervisor that filters cpuid(). All users who are using intel-microcode from non-free instead of running outdated microcode with known errata should be OK here? Running outdated microcode is a bad idea, and noone is making Debian-specific workarounds for all the other CPU errata. cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed
Re: libc recently more aggressive about pthread locks in stable ?
Henrique de Moraes Holschuh writes ("Re: libc recently more aggressive about pthread locks in stable ?"): > Per logs from message #15 on bug #842796: > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15 > > SIGSEGV on __lll_unlock_elision is a signature (IME with very high > confidence) of an attempt to unlock an already unlocked lock while > running under hardware lock elision. I don't know anything about hardware lock elision... > Well, unlocking an already unlocked lock is a pthreads API rule > violation, and it is going to crash the process on something that > implements hardware lock elision. ... but you are of course correct about this. I debugged the problem with ghostscript, and it was indeed violating the pthreads rules. I have filed #843324 with a patch for Debian to backport the corresponding upstream fix. I don't understand the wider logic in ghostscript; the bug was in the colour space management code and occurred when a function was called with two pointer arguments which were actually aliases of the same colourspace-related data structure. Converting ghostscript to use recursive mutexes was IMO clearly correct and fixed the bug. > If the problem is too widespread and too hard to fix on a large number > of packages, I suppose we could ask the glibc maintainers to consider > disabling hardware lock elision support in stable through a stable > update. I think this would be a good idea. ogg123 and ghostscript are hardly obscure programs. It's difficult to know how bad this problem is, but we would like stable to be useful even on recent hardware. > And what should we do about Debian stretch, then? Perhaps we could add the assert you suggest, on non-lock-elision hardware. Whether to do that would depend on its performance impact. TBH I wonder whether we really want to be giving an evidently shonky codebase boobytrapped mutexes by default. We could change the default mutex type to recursive and make all of these bugs go away. Ian. -- Ian Jackson <ijack...@chiark.greenend.org.uk> These opinions are my own. If I emailed you from an address @fyvzl.net or @evade.org.uk, that is a private address which bypasses my fierce spamfilter.
Re: libc recently more aggressive about pthread locks in stable ?
On Sun, 06 Nov 2016, Ben Hutchings wrote: > It's worth noting that TSX is broken in 'Haswell' processors and is > supposed to be disabled via a microcode update. I don't know whether > glibc avoids using it on these processors if the microcode update is > not applied. (Linux doesn't appear to hide the feature flags.) It does avoid it. For glibc libpthreads, Debian has blacklisted Intel TSX use [in libpthreads] on all of Haswell and much of Broadwell. But anything else *will* attempt to use it, people query cpuid directly for these things. You need a hypervisor that filters cpuid(). -- Henrique Holschuh
Re: libc recently more aggressive about pthread locks in stable ?
[resending with correct Cc:] I believe that similar bugs have been afflicting hurd and kfreebsd debian ports for some time. In retrospect, it's too bad these reports weren't given more attention, because it could have made things better for Linux platforms as well. :-/ see e.g., https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=671785#48 Jeff
Re: libc recently more aggressive about pthread locks in stable ?
On Sat, 2016-11-05 at 20:32 +0100, Christian Seiler wrote: > On 11/05/2016 08:13 PM, Ian Jackson wrote: > > I have just been debugging a ghostscript segfault on jessie amd64. > > > > Looking at the code, I think that gs in jessie is plainly violating > > the rules about the use of pthread locks. On my partner's machine, > > this makes it segfault on termination (with some input files, at > > least). On my machine it works just fine. The code in sid is better. > > > > I recently encountered what seems to be a similar bug in ogg123 in > > stable. #842796. > > > > Has something changed in jessie's libc recently ? I find it difficult > > to imagine that these bugs would have been missed earlier during the > > life of jessie. > > Recently Frank Fegert discovered a problem with locking in open-iscsi > that only occurs on new hardware. The code previously was wrong, but > earlier CPUs were more forgiving when it came to this error and it > couldn't be triggered. > > Frank wrote about the problem in his blog in great detail: > http://www.bityard.org/blog/2016/08/05/debugging_segfaults_open-iscsi_iscsiuio_intel_broadwell [...] This is not really a case of older CPUs being 'more forgiving'; they had no locking operations[*] and nothing to forgive. However, glibc uses transactional memory (TSX) on the newer CPUs that implement it, and that new code does result in the CPU detecting some locking errors. It's worth noting that TSX is broken in 'Haswell' processors and is supposed to be disabled via a microcode update. I don't know whether glibc avoids using it on these processors if the microcode update is not applied. (Linux doesn't appear to hide the feature flags.) * The LOCK prefix is for 'bus locking' during a single instruction, i.e. making it atomic. The CPU can't know what higher-level operation it's being used for. Ben. -- Ben Hutchings The world is coming to an end. Please log off. signature.asc Description: This is a digitally signed message part
Re: libc recently more aggressive about pthread locks in stable ?
On Sat, 05 Nov 2016, Ian Jackson wrote: > Looking at the code, I think that gs in jessie is plainly violating > the rules about the use of pthread locks. On my partner's machine, Per logs from message #15 on bug #842796: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15 SIGSEGV on __lll_unlock_elision is a signature (IME with very high confidence) of an attempt to unlock an already unlocked lock while running under hardware lock elision. Well, unlocking an already unlocked lock is a pthreads API rule violation, and it is going to crash the process on something that implements hardware lock elision. These would be Intel x86 processors with TSX enabled[1] for Debian 8/jessie. For Debian 9/stretch and for unstable, I believe it also includes IBM Power8, and s390x systems -- AFAIK they won't forgive an attempt to unlock an unlocked lock any more than Intel TSX does. [1] Broadwell-E, Skylake, and later processors, as well as Xeon *v5 processors. I am not sure if we blacklisted any of the Xeon *v4 or not, and too tired to look their model numbers up right now. Unfortunately, when hardware lock elision support was added to glibc upstream, libpthreads was *not* changed to properly assert() this forbidden condition on the non-hardware-elision codepaths. Such an assert() would have given us consistent behavior, thus flushing the bugs out in the open... at the cost of a performance hit (I have no idea how severe), and much screaming. To be fair: it is likely nobody upstream had any idea of just how much code got libpthreads usage wrong... and we certainly didn't know better in Debian, either. Well, now we're going to find out :-( BTW, AFAIK libpthreads still doesn't have any such assert(), so there's likely a lot of such buggy code in unstable still. This is going to cause trouble for Debian stretch, too. > Has something changed in jessie's libc recently ? I find it difficult > to imagine that these bugs would have been missed earlier during the > life of jessie. The required hardware was not widely available at the time, the knowledge of how hardware lock elision would really behave was sparse outside of Intel and IBM -- so people either didn't know, or did not grasp the importance of the fact that the hardware would be utterly intolerant to something that the old code was too lenient about -- and libpthreads was not instrumented to compensate for that. I actually recommended that it would be safer to disable lock elision for jessie[2]: the sharp corners nature of the code in glibc 2.19 scared me, as well as just how messed up the implementation on Intel processors were at the time. Unfortunately, I didn't push for it at all: I didn't know how correct I were at the time[3]. [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195#50 The hard truth is that nobody in Debian knew how deep those murky waters were at the time[3], and I don't think glibc upstream developers did either. So, we limited ourselves in Debian to blacklisting the processors where Intel (either for sure, or highly likely) screwed it up beyond repair. [3] A number of subtle Intel TSX errata were fixed by Skylake and Broadwell microcode updates, and the latest ones are quite recent. The until-then latent (or subtle) broken locking bugs in applications/libs becoming high-hitter crashers as more users get newer computers, etc. Anyway, any library or application that hits this issue has broken locking, plain and simple. A package crashing from this issue very likely requires a stable update to fix the locking (which won't always be a trivial fix, either), even if we changed libpthreads to disable lock elision support and it stopped the crashes -- even if it wouldn't crash anymore, the locking would still be broken and therefore suspect of not being as effective as it would have to be to ensure correct operation at all times. > I will try to make a patch to fix ghostscript, or at least file a > proper bug. But, if there was a libc change, would it be possible to > revert it or make some kind of workaround ? If the problem is too widespread and too hard to fix on a large number of packages, I suppose we could ask the glibc maintainers to consider disabling hardware lock elision support in stable through a stable update. Such a change to glibc would likely requires some patches to ensure it *really* disabled Intel TSX opcode/instruction insertion, but I think we already ship all of them as part of the Intel TSX blacklist. The result would need real-world testing on an up-to-date Skylake box as well as objdump inspection to ensure *no* TSX-related instructions leaked into the binaries. And what should we do about Debian stretch, then? Some references: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=824191 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800574 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=762195 -- Henrique Holschuh
Re: libc recently more aggressive about pthread locks in stable ?
On 2016-11-05 19:13, Ian Jackson wrote: > I have just been debugging a ghostscript segfault on jessie amd64. > > Looking at the code, I think that gs in jessie is plainly violating > the rules about the use of pthread locks. On my partner's machine, > this makes it segfault on termination (with some input files, at > least). On my machine it works just fine. The code in sid is better. > > I recently encountered what seems to be a similar bug in ogg123 in > stable. #842796. > > Has something changed in jessie's libc recently ? I find it difficult > to imagine that these bugs would have been missed earlier during the > life of jessie. I think you just got a new machine with a CPU supporting the TSX instructions, which are more picky about following the pthreads semantics. Unfortunately given Intel fuck-up on TSX implementation in Haswell and some Broadwell CPUs, they had to disable TSX instructions though firmware updates, which in turns means we haven't got all packages in Jessie tested by a wide set of people. Aurelien -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net
Re: libc recently more aggressive about pthread locks in stable ?
Ian Jackson writes ("libc recently more aggressive about pthread locks in stable ?"): > I have just been debugging a ghostscript segfault on jessie amd64. ... > I recently encountered what seems to be a similar bug in ogg123 in > stable. #842796. > > Has something changed in jessie's libc recently ? I find it difficult > to imagine that these bugs would have been missed earlier during the > life of jessie. > > I will try to make a patch to fix ghostscript, or at least file a > proper bug. But, if there was a libc change, would it be possible to > revert it or make some kind of workaround ? FYI, the ghostscript bug, with patch for jessie, is #843324. sid's ghostscript is fine and I think stretch's is too. Ian. -- Ian JacksonThese opinions are my own. If I emailed you from an address @fyvzl.net or @evade.org.uk, that is a private address which bypasses my fierce spamfilter.
Re: libc recently more aggressive about pthread locks in stable ?
On 11/05/2016 08:13 PM, Ian Jackson wrote: > I have just been debugging a ghostscript segfault on jessie amd64. > > Looking at the code, I think that gs in jessie is plainly violating > the rules about the use of pthread locks. On my partner's machine, > this makes it segfault on termination (with some input files, at > least). On my machine it works just fine. The code in sid is better. > > I recently encountered what seems to be a similar bug in ogg123 in > stable. #842796. > > Has something changed in jessie's libc recently ? I find it difficult > to imagine that these bugs would have been missed earlier during the > life of jessie. Recently Frank Fegert discovered a problem with locking in open-iscsi that only occurs on new hardware. The code previously was wrong, but earlier CPUs were more forgiving when it came to this error and it couldn't be triggered. Frank wrote about the problem in his blog in great detail: http://www.bityard.org/blog/2016/08/05/debugging_segfaults_open-iscsi_iscsiuio_intel_broadwell I haven't looked in detail at your problem, but I could easily imagine that the problem you're experiencing with other packages is similar, especially since you mentioned migrating to new hardware. Hope that helps. Regards, Christian PS: In case someone was wondering: the specific problem with open-iscsi is now fixed in sid, testing and jessie-backports; jessie is not affected because we didn't yet build the component with the issue there.