[
https://issues.apache.org/jira/browse/HADOOP-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526863#comment-17526863
]
Catherinot Remi commented on HADOOP-18217:
------------------------------------------
The Thread delaying the Runtime.halt call being a last resort safety net, I
think its thread should really be outiside anything else hadoop or any running
framework (mapreduce, yarn, tez, spark, etc.) does, so it should not be a
thread from a pool someone else may use (and starve, preventing the halt to be
called, or interrupt making the halt to be triggered to early). It can still
use the Thread ShutdownHookManager register as a system hook as its launcher
though, this would reduce the number of needed extra threads to 1.
About calling ExitUtil.halt ok, but ExitUtil needs a fix too then. It catches
Exception in case of OOM, but unfortunately OOM are Errors, not Exceptions.
Currently there are scenarios where it would not call Runtime.halt while it
should have. The call, when not disbaled, should be done in a finally block to
be called whatever happened. Same goes for the ExitUtil wrapper around
System.exit call: it needs a try/finally whene enabled.
So that would go that way :
ShutdownHookManager being responsible of the all the hadoop like hooks will
also be responsible to preallocated and start the delayed halt thread. Starting
the delayed halt will be its 1st thing to do when hooked, not even logging, not
reading confs, etc. It will continue what it is coded for after that the way it
currently does.
ShutdowHookManager own hook thread will have a name (was hard to understand
what was this "Thread-10" blocking my JVMs, it had no hadoop stacks in it even
if it was coming from it). The delayed halt thread will have a name too.
ShutdownHookManager is already shared and handles its singleton, so that would
avoid (hadoop util being in the bootstarp classloader) having it installed more
than once with different delays
Delayed halt thread will sleep and then use ExitUtil to do call its halt
wrapper, so may have no effect if ExitUtil's halt is disabled
ExitUtil exit/halt wrappers will be patched to be more robust to all Throwable,
so even Error, not just Exception
ShutdownHookManager delayed halt thread and the HaltException will be
pre-allocated, including loading the configuration to parse the timeout delay
from it. Any Throwable raised during this initialization phase would end up
setting up defaults and being logged. Throwable won't forbid the
ShutdownHookManager to install itself as a hook.
delayed halt thread will be a daemon so it does not block the JVM from dying of
normal death if all hooks ends before the configured delay
question : what halt code to use ? "configurable with a default" or "hardcoded
but documented" ?
> shutdownhookmanager should not be multithreaded (deadlock possible)
> -------------------------------------------------------------------
>
> Key: HADOOP-18217
> URL: https://issues.apache.org/jira/browse/HADOOP-18217
> Project: Hadoop Common
> Issue Type: Bug
> Components: util
> Affects Versions: 2.10.1
> Environment: linux, windows, any version
> Reporter: Catherinot Remi
> Priority: Minor
> Attachments: wtf.java
>
>
> the ShutdownHookManager class uses an executor to run hooks to have a
> "timeout" notion around them. It does this using a single threaded executor.
> It can leads to deadlock leaving a never-shutting-down JVM with this
> execution flow:
> * JVM need to exit (only daemon threads remaining or someone called
> System.exit)
> * ShutdowHookManager kicks in
> * SHMngr executor start running some hooks
> * SHMngr executor thread kicks in and, as a side effect, run some code from
> one of the hook that calls System.exit (as a side effect from an external lib
> for example)
> * the executor thread is waiting for a lock because another thread already
> entered System.exit and has its internal lock, so the executor never returns.
> * SHMngr never returns
> * 1st call to System.exit never returns
> * JVM stuck
>
> using an executor with a single thread does "fake" timeouts (the task keeps
> running, you can interrupt it but until it stumble upon some piece of code
> that is interruptible (like an IO) it will keep running) especially since the
> executor is a single threaded one. So it has this bug for example :
> * caller submit 1st hook (bad one that would need 1 hour of runtime and that
> cannot be interrupted)
> * executor start 1st hook
> * caller of the future 1st hook result timeout
> * caller submit 2nd hook
> * bug : 1 hook still running, 2nd hook triggers a timeout but never got the
> chance to run anyway, so 1st faulty hook makes it impossible for any other
> hook to have a chance to run, so running hooks in a single separate thread
> does not allow to run other hooks in parallel to long ones.
>
> If we really really want to timeout the JVM shutdown, even accepting maybe
> dirty shutdown, it should rather handle the hooks inside the initial thread
> (not spawning new one(s) so not triggering the deadlock described on the 1st
> place) and if a timeout was configured, only spawn a single parallel daemon
> thread that sleeps the timeout delay, and then use Runtime.halt (which bypass
> the hook system so should not trigger the deadlock). If the normal
> System.exit ends before the timeout delay everything is fine. If the
> System.exit took to much time, the JVM is killed and so the reason why this
> multithreaded shutdown hook implementation was created is satisfied (avoding
> having hanging JVMs)
>
> Had the bug with both oracle and open jdk builds, all in 1.8 major version.
> hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in
> another thread
>
> Another solution is of course to configure the timeout AND to have as many
> threads as needed to run the hooks so to have at least some gain to offset
> the pain of the dealock scenario
>
> EDIT: added some logs and reproduced the problem. in fact it is located after
> triggering all the hook entries and before shutting down the executor.
> Current code, after running the hooks, creates a new Configuration object and
> reads the configured timeout from it, applies this timeout to shutdown the
> executor. I sometimes run with a classloader doing remote classloading,
> Configuration loads its content using this classloader, so when shutting down
> the JVM and some network error occurs the classloader fails to load the
> ressources needed by Configuration. So the code crash before shutting down
> the executor and ends up inside the thread's default uncaught throwable
> handler, which was calling System.exit, so got stuck, so shutting down the
> executor never returned, so does the JVM.
> So, forget about the halt stuff (even if it is a last ressort very robust
> safety net). Still I'll do a small adjustement to the final executor shutdown
> code to be slightly more robust to even the strangest exceptions/errors it
> encounters.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]