On Sat, Oct 13, 2018 at 04:00:32AM -0300, Alexandre Oliva wrote: > On Oct 11, 2018, Rich Felker <dal...@libc.org> wrote: > > > However the only way to omit this path from TLSDESC is > > installing the new TLS to all live threads at dlopen time > > Well, one could just as easily drop dynamic TLS altogether, forcing all > TLS into the Static TLS block until it fills up, and failing for good if > it does. But then, you don't need TLS Descriptors at all, just go with > Initial Exec. It helps if init can set the Static TLS block size from > an environment variable or somesuch.
This is not at all equivalent; it makes a situation where you have to significantly over-provision resources or have things fail, and there's no reasonable way of configuring it (the environment is certainly not a correct way because it varies by application how much should be used). On the other hand, installing dynamic TLS at dlopen time has no visible difference in behavior to the user. It's purely an implementation detail. So "just as easily" is not a reasonable way to describe it. > But your statement appears to be conflating two separate issues, namely > the allocation of a TLS Descriptor during lazy TLS resolution, and the > allocation of per-thread dynamic TLS blocks upon first access. For musl I'm not thinking about either of these, since we don't do either. All dynamic TLS memory is allocated at dlopen time. It's only claimed just-in-time by the thread needing it. Your two cases both potentially involve malloc, and thereby external code not under libc/ldso control, that might clobber extra registers. Mine doesn't, but still involves C code (because the logic to claim new dynamic TLS is sufficiently complex that nobody wants to write it in asm for each arch) and thereby depends on assumptions about what registers the compiler can generate code that clobbers. > The former is just as easy and harmless to disable as lazy function > relocations. The latter is not exclusive to TLSDesc: __tls_get_addr has > historically used malloc to grow the DTV and to allocate dynamic TLS > blocks, and if overriders to malloc end up using/clobbering unexpected > registers, even if just because of lazy PLT resolution for calls in its > implementation, things might go wrong. Sure enough, __tls_get_addr > doesn't use a specialized ABI, so this is usually not an issue. Indeed. > > That's actually not a bad idea -- it drops the compare/branch from the > > dynamic tlsdesc code path, and likewise in __tls_get_addr, making both > > forms of dynamic TLS (possibly considerably) faster. > > But then you have to add some form of synchronization so that other > threads can actually mess with your DTV without conflicts, from If a thread has not synchronized with the dlopen that added new dynamic TLS, there's no way it can access that DTV slot, so it doesn't matter if it reads the old version or the new version of its DTV. However, if it does read the new version of the DTV, it needs to see the correct entries in it. You can rely on some sort of hardware-guaranteed consume-order property for this, but my preference is just SYS_membarrier or equivalent after setting up the new DTV but before installing it. Then there is no synchronization requirement on access. > releasing dlclose()d dynamic blocks to growing the DTV and releasing the > old DTV while its owner thread is using it. Likewise if a DSO is being unloaded, the thread calling dlclose must have synchronized with any thread that could be using functions/data from the DSO to ensure that they do not access them again, so there is no restriction on freeing dynamic TLS blocks. There is a restriction on freeing old DTVs, of course; if you're going to free these, I'm pretty sure the right way to do it is have the thread itself free them at exit. Anything else seems to impose ridiculous synchronization costs for no measurable gain. > I wonder if it would make sense to introduce an overridable > call-clobbered-regs-preserving wrapper function, that lazy PLT resolvers > and Dynamic TLSDesc calls would call, and that could be easily extended > to preserve extended register files without having to modify the library > proper. LD_PRELOAD could bring it in, and it could even use ifunc > relocations, to be able to cover all available registers on arches with > multiple register file extensions. The right way to do this would be to have the kernel provide it via vdso; any other approach is going to fail to cover certain configurations. Assuming kernel versions >N promised to do this, ldso could just hard-code a "pre-vdso-register-file-save" version of the function that would cover all registers supported by kernels too old to have the vdso function, and use it when the vdso one is missing. Rich