Re: TLSDESC clobber ABI stability/futureproofness?

Rich Felker Sat, 13 Oct 2018 19:21:50 -0700

On Sat, Oct 13, 2018 at 04:00:32AM -0300, Alexandre Oliva wrote:
> On Oct 11, 2018, Rich Felker <dal...@libc.org> wrote:
> 
> > However the only way to omit this path from TLSDESC is
> > installing the new TLS to all live threads at dlopen time
> 
> Well, one could just as easily drop dynamic TLS altogether, forcing all
> TLS into the Static TLS block until it fills up, and failing for good if
> it does.  But then, you don't need TLS Descriptors at all, just go with
> Initial Exec.  It helps if init can set the Static TLS block size from
> an environment variable or somesuch.


This is not at all equivalent; it makes a situation where you have to
significantly over-provision resources or have things fail, and
there's no reasonable way of configuring it (the environment is
certainly not a correct way because it varies by application how much
should be used).

On the other hand, installing dynamic TLS at dlopen time has no
visible difference in behavior to the user. It's purely an
implementation detail. So "just as easily" is not a reasonable way to
describe it.

> But your statement appears to be conflating two separate issues, namely
> the allocation of a TLS Descriptor during lazy TLS resolution, and the
> allocation of per-thread dynamic TLS blocks upon first access.

For musl I'm not thinking about either of these, since we don't do
either. All dynamic TLS memory is allocated at dlopen time. It's only
claimed just-in-time by the thread needing it.

Your two cases both potentially involve malloc, and thereby external
code not under libc/ldso control, that might clobber extra registers.
Mine doesn't, but still involves C code (because the logic to claim
new dynamic TLS is sufficiently complex that nobody wants to write it
in asm for each arch) and thereby depends on assumptions about what
registers the compiler can generate code that clobbers.

> The former is just as easy and harmless to disable as lazy function
> relocations.  The latter is not exclusive to TLSDesc: __tls_get_addr has
> historically used malloc to grow the DTV and to allocate dynamic TLS
> blocks, and if overriders to malloc end up using/clobbering unexpected
> registers, even if just because of lazy PLT resolution for calls in its
> implementation, things might go wrong.  Sure enough, __tls_get_addr
> doesn't use a specialized ABI, so this is usually not an issue.

Indeed.

> > That's actually not a bad idea -- it drops the compare/branch from the
> > dynamic tlsdesc code path, and likewise in __tls_get_addr, making both
> > forms of dynamic TLS (possibly considerably) faster.
> 
> But then you have to add some form of synchronization so that other
> threads can actually mess with your DTV without conflicts, from

If a thread has not synchronized with the dlopen that added new
dynamic TLS, there's no way it can access that DTV slot, so it doesn't
matter if it reads the old version or the new version of its DTV.
However, if it does read the new version of the DTV, it needs to see
the correct entries in it. You can rely on some sort of
hardware-guaranteed consume-order property for this, but my preference
is just SYS_membarrier or equivalent after setting up the new DTV but
before installing it. Then there is no synchronization requirement on
access.

> releasing dlclose()d dynamic blocks to growing the DTV and releasing the
> old DTV while its owner thread is using it.

Likewise if a DSO is being unloaded, the thread calling dlclose must
have synchronized with any thread that could be using functions/data
from the DSO to ensure that they do not access them again, so there is
no restriction on freeing dynamic TLS blocks. There is a restriction
on freeing old DTVs, of course; if you're going to free these, I'm
pretty sure the right way to do it is have the thread itself free them
at exit. Anything else seems to impose ridiculous synchronization
costs for no measurable gain.

> I wonder if it would make sense to introduce an overridable
> call-clobbered-regs-preserving wrapper function, that lazy PLT resolvers
> and Dynamic TLSDesc calls would call, and that could be easily extended
> to preserve extended register files without having to modify the library
> proper.  LD_PRELOAD could bring it in, and it could even use ifunc
> relocations, to be able to cover all available registers on arches with
> multiple register file extensions.

The right way to do this would be to have the kernel provide it via
vdso; any other approach is going to fail to cover certain
configurations. Assuming kernel versions >N promised to do this, ldso
could just hard-code a "pre-vdso-register-file-save" version of the
function that would cover all registers supported by kernels too old
to have the vdso function, and use it when the vdso one is missing.

Rich

Re: TLSDESC clobber ABI stability/futureproofness?

Reply via email to