Sorry the formatting of the replies is getting totally screwed up now :(
On 10/11/2015 8:20 PM, Thomas Stüfe wrote:
Hi David,
On Fri, Nov 6, 2015 at 11:20 PM, David Holmes <[email protected]
<mailto:[email protected]>> wrote:
On 6/11/2015 9:52 PM, Thomas Stüfe wrote:
Hi David,
On Fri, Nov 6, 2015 at 7:26 AM, David Holmes
<[email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>> wrote:
Hi Jeremy,
Okay I have read:
https://sourceware.org/glibc/wiki/TLSandSignals
and the tree of mail postings referenced therefrom - great
reading! :)
So basic problem: access to __thread variables is not
async-signal-safe
Exacerbator to problem: first attempt to even read a __thread
variable can lead to allocation which is the real problem in
relation to async-signal-safety
I mention the exacerbator because pthread_getspecific and
pthread_setSpecific are also not async-signal-safe but we
already
use them. However, pthread_getspecific is in fact (per
email threads
linked above) effectively async-signal-safe, and further a
call to
pthread_getspecific never results in a call to
pthread_setspecific
or an allocation. Hence the pthread functions are almost,
if not
completely, safe in practice with reasonable uses (ie only
read from
signal handler). Which explain this code in existing
Thread::current()
#ifdef PARANOID
// Signal handler should call
ThreadLocalStorage::get_thread_slow()
Thread* t = ThreadLocalStorage::get_thread_slow();
assert(t != NULL && !t->is_inside_signal_handler(),
"Don't use Thread::current() inside signal handler");
#endif
So problem scenario is: use of __thread variable (that
belongs to
the shared-library) in a signal handler.
Solution 0: don't do that. Seriously - like any other
async-signal-unsafe stuff we should not be using it in real
signal
handlers. The crash handler is a different matter - we try
all sorts
there because it might work and you can't die twice.
Otherwise: narrow the window of exposure.
1. We ensure we initialize thread_current (even with a
dummy value)
as early as possible in the thread that loads libjvm. As we
have no
signal handlers installed at that point that might use the same
variable, we can not hit the problem scenario.
2. We ensure we initialize thread_current in a new thread
with all
signals blocked. This again avoids the problem scenario.
3. We initialize thread_current in an attaching thread as
soon as
possible and we again first block all signals.
That still leaves the problem of an unattached native
thread taking
a signal whilst in async-signal-unsafe code, and executing
a signal
handler which in turns tries to access thread_current for
the first
time. This signal handler need not be an actual JVM
handler, but one
attached by other native code eg an agent. I'm not clear in the
latter case how reasonable it is for an agent's handler to
try and
do things from an unattached thread - and we don't claim
any JNI
interfaces can, or should, be called from a signal handler
- but it
is something you can probably get away with today.
Let me also point out that we already effectively have this
code in
Solaris already (at the ThreadLocalStorage class level). So
if there
is something here that will prevent the current proposal we
already
have a problem on Solaris. :(
Thoughts/comments/suggestions?
The first problem: thread initializes TLS variable, gets
interrupted and
accesses the half-initialized variable from within the signal
handler.
This could happen today too, or? but I think we never saw this.
That depends on the state of signal masks at the time of the
initialization. For threads created in the VM and for threads
attached to the VM it is likely not an issue. Unattached threads
could in theory try to access a TLS variable from a signal handler,
but they will never be initializing that variable. Of course the
unattached thread could be initializing a completely distinct TLS
variable, but reading a different TLS variable from the signal
handler does not seem to be an issue (in theory it may be but this
is an extreme corner case).
In theory, it could be mitigated by some careful testing before
using
the Thread::current() value in the signal handler. Like, put an
eyecatcher at the beginning of the Thread structure and check
that using
SafeFetch.
There is no way to access the Thread structure before calling
Thread::current(). And the potential problem is with unattached
threads which have no Thread structure. For threads attached to the
VM, or attaching, my three steps will deal with any potential problems.
As for the second problem - recursive malloc() deadlocks - I am at a
loss. I do not fully understand though why pthread_getspecific is
different - does it not have to allocate place for the TLS
variable too?
No, pthread_getspecific does not have to allocate. Presumably it is
written in a way that attempting to index a TLS variable that has
not been allocated just returns an error (EINVAL?).
It would return NULL.
The problem with __thread is that even a read will attempt to do the
allocation - arguably (as the Google folk did argue) this is wrong,
or else should be done in an async-safe way.
I looked up the implementation of pthread_getspecific and
pthread_setspecific in the glibc and now understand better why pthread
tls is considered safe here.
glibc allows for 1024 tls slots which are organized as a 32x32 sparse
array, whose second level arrays are only allocated if the first slot in
it is used by pthread_setspecific. So, only when writing the slot. It
also means that the number of allocation calls is smaller than with
__thread - instead of (presumably) calling malloc() for every instance
of __thread, it only calls at the maximum 32 times. And the first 32
slots are already allocated in the pthread structure, so they are free.
This means that even if one were to write to a TLS slot in the signal
handler, chances of it mallocing are quite small.
__thread will only need to malloc (in the context we are discussing)
when a pre-existing thread first references a TLS variable from a lazily
loaded DSO. Otherwise the space for the DSO's TLS variables are
allocated "statically" when the thread control structures are created.
This does leave me wondering exactly what affect the:
static __thread Thread* _thr_current = NULL;
has in terms of any per-thread allocation. ??
Anyway to reiterate the problem scenario:
- VM has been loaded in a process and signal handlers have been
installed (maybe VM, maybe agent)
- unattached thread is doing a malloc when it takes a signal
- signal handler tries to read __thread variable and we get a malloc
deadlock
As I said I need to determine what signal handlers in the VM might
ever run on an unattached thread, and what they might do.
I don't understand - our signal handler is globally active, no? So any
unattached thread may execute our signal handler at any time, and the
first thing our signal handler does is Thread::current(). If there was a
third party signal handler, it is getting called as chained handler, but
only after our signal handler ran.
The current code uses ThreadLocalStorage::get_thread_slow() in the
signal handler, which uses pthread_getspecific, which is "safe". The new
code would access the __thread variable and have to malloc - which is
unsafe.
Using the JDK launchers there are no unattached threads created before
libjvm is loaded - so the problem would never arise.
I have been looking hard at our signal handlers though and found they
don't seem to match how they are described in various parts of the code
that set the signal masks. My main concern is with process-directed
signals (SIGQUIT, SIGTERM) that trigger specific actions (thread dump,
orderly shutdown). Synchronous signals are not an issue - if the
unattached thread triggers a segv while in malloc then the process is
doomed anyway and a deadlock is less problematic (not great but hardly
in the same league as deadlocking an active fully working VM). But I'm
still having trouble joining all the dots in this code and figuring out
how an unattached thread might react today. I'll continue untangling
this tomorrow.
Thanks, Thomas
(My current feeling is that I'd prefer to keep the pthread TLS solution
but I like your simplifications to the code and would like to keep that
too...)
It was all the complex, inconsistent caching mechanisms employed over
the top of the library based TLS that motivated the cleanup - especially
as the cache on Solaris was shown to be broken. If it was just a simple
library based TLS layer, there would be less motivation to try the
__thread approach - but __thread had the appeal of removing a lot of
duplicated code. A simple library based scheme might be an alternative
if it is performant enough - but not sure I have the time to go back to
square one on this.
Thanks,
David
For a "third-party" signal handler there's really nothing I can do -
they should not be accessing the VM's __thread variables though (and
they cal always introduce their own independent deadlocks by
performing non-async-safe actions).
Thanks,
David
Regards, Thomas
Thanks,
David
On 6/11/2015 1:09 PM, David Holmes wrote:
Hi Jeremy,
I was going to ask you to elaborate :)
On 6/11/2015 12:24 PM, Jeremy Manson wrote:
I should probably elaborate on this. With glibc +
ELF, the
first time a
thread accesses a variable declared __thread, if that
variable is in a
shared library (as opposed to the main executable), the
system calls
malloc() to allocate the space for it. If that
happens in a
signal that
is being delivered during a call to malloc(), then you
usually get a
crash.
My understanding of the ELF ABI for thread-locals -
which I read
about
in the Solaris 11.1 Linkers and libraries guide - does
require
use of
the dynamic TLS model for any dynamically loaded shared
object which
defines a thread-local, but that is what we use as I
understand
it. The
docs state:
"A shared object containing only dynamic TLS can be
loaded following
process startup without limitations. The runtime linker
extends
the list
of initialization records to include the initialization
template
of the
new object. The new object is given an index of m = M +
1. The
counter M is incremented by 1. However, the allocation
of new
TLS blocks
is deferred until the blocks are actually referenced."
Now I guess "extends the list" might be implemented
using malloc
... but
this will only occur in the main thread (the one
started by the
launcher
to load the JVM and become the main thread), at the
time libjvm is
loaded - which will all be over before any agent etc
can run and do
anything. But "allocation ... is deferred" suggests we
may have a
problem until either the first call to Thread::current
or the
call to
Thread::initialize_thread_current. If it is the former
then that
should
occur well before any agent etc can be loaded. And I
can easily
inject
an initial dummy call to initialize_thread_current(null) to
force the
TLS allocation.
This may bite you if AsyncGetCallTrace uses
Thread::current(), and you
use system timers to do profiling. If a thread doing a
malloc() prior
to the first time it accesses Thread::current(),
and it gets
delivered a
signal, it might die. This is especially likely
for pure
native threads
started by native code.
I believe that this is a use case you support, so
you might
want to make
sure it is okay.
For a VM embedded in a process, which already contains
native
threads,
that will later attach to the VM, this may indeed be a
problem. One
would have hoped however that the implementation of TLS
would be
completely robust, at least for something as simple as
getting a
signal
whilst in the allocator.
I'm unclear how to test for or check for this kind of
problem.
Arguably
there could be many things that are async-unsafe in
this way.
Need to think more about this and do some more
research. Would also
appreciate any insight from any glibc and/or ELF gurus.
Thanks.
David
Jeremy
On Thu, Nov 5, 2015 at 5:58 PM, Jeremy Manson
<[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>> wrote:
Something that's bitten me with __thread: it isn't
async-safe when
called from a shared object on Linux. Have
you vetted
to make sure
this doesn't make HS less async-safe?
Jeremy
On Sun, Nov 1, 2015 at 10:40 PM, David Holmes
<[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>> wrote:
bug:
https://bugs.openjdk.java.net/browse/JDK-8132510
Open webrev:
http://cr.openjdk.java.net/~dholmes/8132510/webrev.v2/
A simple (in principle) but wide-ranging
change
which should
appeal to our Code Deletion Engineer's. We
implement
Thread::current() using a
compiler/language-based
thread-local
variable eg:
static __thread Thread *_thr_current;
inline Thread* Thread::current() {
return _thr_current;
}
with an appropriate setter of course. By
doing this
we can
completely remove the platform-specific
ThreadLocalStorage
implementations, and the associated
os::thread_local_storage*
calls, plus all the uses of
ThreadLocalStorage::thread() and
ThreadLocalStorage::get_thread_slow().
This extends
the previous
work done on Solaris to implement
ThreadLocalStorage::thread()
using compiler-based thread-locals.
We can also consolidate nearly all the os_cpu
versions of
MacroAssembler::get_thread on x86 into one cpu
specific one ( a
special variant is still needed for 32-bit
Windows).
As a result of this change we have further
potential cleanups:
- all the
src/os/<os>/vm/thread_<os>.inline.hpp
files are now
completely empty and could also be removed
- the MINIMIZE_RAM_USAGE define (which
avoids use
of the linux
sp-map "cache" on 32-bit) now has no
affect and so
could be
completely removed from the build system
I plan to do the MINIMIZE_RAM_USAGE
removal as a
follow up CR,
but could add the removal of the "inline"
files to
this CR if
people think it worth removing them.
I have one missing piece on Aarch64 - I
need to change
MacroAssembler::get_thread to simply call
Thread::current() as
on other platforms, but I don't know how
to write
that. I would
appreciate it if someone could give me the
right
code for that.
I would also appreciate comments/testing
by the AIX
and PPC64
folk as well.
A concern about memory-leaks had
previously been
raised, but
experiments using simple C code on linux
86 and
Solaris showed
no issues. Also note that Aarch64 already
uses this
kind of
thread-local.
Thanks,
David