[jira] [Work logged] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

ASF GitHub Bot (Jira) Mon, 02 May 2022 01:40:07 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18217?focusedWorklogId=764877&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-764877
 ]


ASF GitHub Bot logged work on HADOOP-18217:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/May/22 08:39
            Start Date: 02/May/22 08:39
    Worklog Time Spent: 10m 
      Work Description: HerCath opened a new pull request, #4255:
URL: https://github.com/apache/hadoop/pull/4255

   …king halt + enlarged catches (robustness) to all Throwables (not just 
Exceptions)
   
   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   I've reduced the synchronized blocks scope so System.exit and Runtime.halt 
calls aren't within their boundaries, so ExitUtil wrappers do not block each 
others (System.exit never returns if called and no SecurityException is raised, 
so ExitUtil.terminate was never releasing the acquired lock, thus forbidding 
ExitUtil.halt to be able to halt the JVM even when in the middle of a graceful 
shutdown, thus it was not behaving like the 2 wrapped java's methods 
System.exit/Runtime.halt which do not block each other)
   I've altered throwable handling:
     - what is catched: was nothing or only Exception, now all Throwables are 
catched (even ThreadDeath)
     - what is rethrown: when exit/halt has been disabled, if what was catched 
is an Error it will be rethrown rather than the initial 
ExitException/HaltException. Other Throwables will be added as suppressed to 
the Exit/HaltException
     - what wasn't catched: if not disabled, even is something was raised that 
wasn't catched before, it is now catched and System.exit/Runtime.halt is always 
called
     - what is suppressed: if the what needs to be rethrown is changed on the 
way, the newly to-be-thrown will have the old one as a suppressed Throwable. 
I've also done this for the Exit/Halt Exception that can supress Throwables 
that are not Error (might not be a so good idea)
   
   ### How was this patch tested?
   No more tests than the existing ones (if any). This case is not really hard 
to reproduce but the test would need to exit a JVM. I've not added such tests 
because if unit does not fork, it would kills the test suite (thus impacting 
all tests). I think developing a robust test for this specific case is way more 
hard and  dangerous to offset the cost of a review, the risk of what could be 
missed by this review.
   
   Easiest way can be reproduced the initial bug: having a shutdown hook call 
ExitUtil.terminate, have another thread that will call ExitUtil.halt after (use 
pauses to ensure it calls it after the hook), witness the JVM not stopping and 
needing either an external kill or a internal Runtime.halt call, maybe check 
the JVM threads' stacks too to view the ExitUtil.terminate call stuck on 
System.exit, and ExitUtil.halt call stuck on ExitUtil.terminate.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




Issue Time Tracking
-------------------

            Worklog Id:     (was: 764877)
    Remaining Estimate: 0h
            Time Spent: 10m

> shutdownhookmanager should not be multithreaded (deadlock possible)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-18217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 2.10.1
>         Environment: linux, windows, any version
>            Reporter: Catherinot Remi
>            Priority: Minor
>         Attachments: wtf.java
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> the ShutdownHookManager class uses an executor to run hooks to have a 
> "timeout" notion around them. It does this using a single threaded executor. 
> It can leads to deadlock leaving a never-shutting-down JVM with this 
> execution flow:
>  * JVM need to exit (only daemon threads remaining or someone called 
> System.exit)
>  * ShutdowHookManager kicks in
>  * SHMngr executor start running some hooks
>  * SHMngr executor thread kicks in and, as a side effect, run some code from 
> one of the hook that calls System.exit (as a side effect from an external lib 
> for example)
>  * the executor thread is waiting for a lock because another thread already 
> entered System.exit and has its internal lock, so the executor never returns.
>  * SHMngr never returns
>  * 1st call to System.exit never returns
>  * JVM stuck
>  
> using an executor with a single thread does "fake" timeouts (the task keeps 
> running, you can interrupt it but until it stumble upon some piece of code 
> that is interruptible (like an IO) it will keep running) especially since the 
> executor is a single threaded one. So it has this bug for example :
>  * caller submit 1st hook (bad one that would need 1 hour of runtime and that 
> cannot be interrupted)
>  * executor start 1st hook
>  * caller of the future 1st hook result timeout
>  * caller submit 2nd hook
>  * bug : 1 hook still running, 2nd hook triggers a timeout but never got the 
> chance to run anyway, so 1st faulty hook makes it impossible for any other 
> hook to have a chance to run, so running hooks in a single separate thread 
> does not allow to run other hooks in parallel to long ones.
>  
> If we really really want to timeout the JVM shutdown, even accepting maybe 
> dirty shutdown, it should rather handle the hooks inside the initial thread 
> (not spawning new one(s) so not triggering the deadlock described on the 1st 
> place) and if a timeout was configured, only spawn a single parallel daemon 
> thread that sleeps the timeout delay, and then use Runtime.halt (which bypass 
> the hook system so should not trigger the deadlock). If the normal 
> System.exit ends before the timeout delay everything is fine. If the 
> System.exit took to much time, the JVM is killed and so the reason why this 
> multithreaded shutdown hook implementation was created is satisfied (avoding 
> having hanging JVMs)
>  
> Had the bug with both oracle and open jdk builds, all in 1.8 major version. 
> hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in 
> another thread
>  
> Another solution is of course to configure the timeout AND to have as many 
> threads as needed to run the hooks so to have at least some gain to offset 
> the pain of the dealock scenario
>  
> EDIT: added some logs and reproduced the problem. in fact it is located after 
> triggering all the hook entries and before shutting down the executor. 
> Current code, after running the hooks, creates a new Configuration object and 
> reads the configured timeout from it, applies this timeout to shutdown the 
> executor. I sometimes run with a classloader doing remote classloading, 
> Configuration loads its content using this classloader, so when shutting down 
> the JVM and some network error occurs the classloader fails to load the 
> ressources needed by Configuration. So the code crash before shutting down 
> the executor and ends up inside the thread's default uncaught throwable 
> handler, which was calling System.exit, so got stuck, so shutting down the 
> executor never returned, so does the JVM.
> So, forget about the halt stuff (even if it is a last ressort very robust 
> safety net). Still I'll do a small adjustement to the final executor shutdown 
> code to be slightly more robust to even the strangest exceptions/errors it 
> encounters.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

Reply via email to