On 2016-11-06 01:12, Henrique de Moraes Holschuh wrote:
> On Sat, 05 Nov 2016, Ian Jackson wrote:
> > Looking at the code, I think that gs in jessie is plainly violating
> > the rules about the use of pthread locks.  On my partner's machine,
> 
> Per logs from message #15 on bug #842796:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=842796#15
> 
> SIGSEGV on __lll_unlock_elision is a signature (IME with very high
> confidence) of an attempt to unlock an already unlocked lock while
> running under hardware lock elision.
> 
> 
> Well, unlocking an already unlocked lock is a pthreads API rule
> violation, and it is going to crash the process on something that
> implements hardware lock elision.
> 
> These would be Intel x86 processors with TSX enabled[1] for Debian
> 8/jessie.  For Debian 9/stretch and for unstable, I believe it also
> includes IBM Power8, and s390x systems -- AFAIK they won't forgive an
> attempt to unlock an unlocked lock any more than Intel TSX does.
> 
> [1] Broadwell-E, Skylake, and later processors, as well as Xeon *v5
>     processors.  I am not sure if we blacklisted any of the Xeon *v4
>     or not, and too tired to look their model numbers up right now.
> 
> Unfortunately, when hardware lock elision support was added to glibc
> upstream, libpthreads was *not* changed to properly assert() this
> forbidden condition on the non-hardware-elision codepaths.  Such an
> assert() would have given us consistent behavior, thus flushing the bugs
> out in the open... at the cost of a performance hit (I have no idea how
> severe), and much screaming.

This has not been done has it would have a severe performance hit. That
said error checking mutexes also exist in GLIBC, and have been designed
exactly for that, ie they trade performance for correctness.

> To be fair: it is likely nobody upstream had any idea of just how much
> code got libpthreads usage wrong... and we certainly didn't know better
> in Debian, either.  Well, now we're going to find out :-(
> 
> BTW, AFAIK libpthreads still doesn't have any such assert(), so there's
> likely a lot of such buggy code in unstable still.  This is going to
> cause trouble for Debian stretch, too.

I don't expect it to be worse than jessie, actually probably better as
some of the bugs have been fixed by the various upstreams in the
meantime. Also remember that TSX is just making the bug more visible. It
means that users without TSX might experience hangs instead. There are
actually two "hang bugs" reporting against ghostscript, that could be
fixed by fixing the TSX bug.

[...]

> If the problem is too widespread and too hard to fix on a large number
> of packages, I suppose we could ask the glibc maintainers to consider
> disabling hardware lock elision support in stable through a stable
> update.
> 
> Such a change to glibc would likely requires some patches to ensure it
> *really* disabled Intel TSX opcode/instruction insertion, but I think we
> already ship all of them as part of the Intel TSX blacklist.  The result
> would need real-world testing on an up-to-date Skylake box as well as
> objdump inspection to ensure *no* TSX-related instructions leaked into
> the binaries.

We can disable multiarch by passing "--enable-lock-elision". There is no
risk that the instructions are leaked into the binaries except of course
for static binaries. That said so far we talk about a few packages only.
A lot of bugs have already been fixed during the jessie release cycle, I
remember sending patches for that.

> And what should we do about Debian stretch, then?

As said above disabling TSX in glibc is just hidding issues to users. We
should instead try to detect as many bugs as possible (possibly fixing
the corresponding bugs in jessie). One way would be to get a box with
TSX instructions and use it for the reproducible builds and/or the
autopkgtests.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurel...@aurel32.net                 http://www.aurel32.net

Attachment: signature.asc
Description: PGP signature

Reply via email to