[jira] [Work logged] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

ASF GitHub Bot (Jira) Tue, 03 May 2022 12:47:01 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18217?focusedWorklogId=765676&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-765676
 ]


ASF GitHub Bot logged work on HADOOP-18217:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/May/22 19:45
            Start Date: 03/May/22 19:45
    Worklog Time Spent: 10m 
      Work Description: HerCath commented on PR #4255:
URL: https://github.com/apache/hadoop/pull/4255#issuecomment-1116496071

   Hi,
   
   Thanks for the reviews. I've started to adjust my code to integrate them.
   
   Ok, I'll adjust code style about comment + rename the variable to caught. 
I'm also adjusting the javadoc to make it clear that in enable mode all 
throwables are caught, even errors (except maybe only ThreadDeath if it happen 
after the last try/catch). And in disable mode, the argument exception is 
re-throw suppressing other internally thrown throwables unless at least one is 
an error, in which case the 1st error one is rethrow, suppressing anything 
else. like before the patch: I don't document about RuntimeException subclasses 
or Error that could be thrown by System.exit/Runtime.halt. 
   
   About atomic versus volatile. The "disable" flags (for halt and exit) are 
not used in test-and-set pattern, nor any kind of do-several-actions that would 
have to be atomically done so volatile has enough thread sharing capabilities. 
The "first(Exit|Halt)Exception" fields would benefit from it though so I'm 
changing them to AtomicReference. It the "disable" flags were more tightly 
coupled with the "first" fields, then a synchronized would be needed to handle 
both field values and ensure they always describe a legit ("flag", "first") 
tuple but that's not the case here too: each can be resetted whatever the other 
value is. Moving to Atomic has also another benefit: it allows to remove all 
synchronized blocks on the ExitUtil class, so even if badly/evil code enters a 
synchronized blocks on ExitUtil.class and never leaves it, ExitUtil 
terminate/halt code will still be able run without blocking.
   
   About tests : Since i'm changing to AtomicReference I will add some (i still 
have to check about maybe some existing one) but only when in disable mode.
   
   When enable mode, I'm a bit afraid of the impact if I miss something in the 
tests. A real test would require to spawn an external JVM with a valid 
classpath that would need to call ExitUtil with expected codes and check its 
return code when it dies or kills it after a timeout (the expected codes need 
to differ from those due to a kill on timeout). If the test for some obscur 
reason really go wrong and do not kill the external JVM I don't want to impact 
other tests because of a hanging JVM. If the server/platform is supposed to 
have a long uptime, I don't want to leak spawned processes, making it an 
unstable/unreliable testing environment after several test runs. I don't know 
if such tests would need to be cross platform too. I've access only to windows 
and linux, not MacOS so i won't be able to check if I really kill the spawned 
JVM (i would use java.lang.Process which should be portable, even for the 
kill). I also don't know the impact on other developpers if they run tests on 
their own machine.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 765676)
    Time Spent: 50m  (was: 40m)

> shutdownhookmanager should not be multithreaded (deadlock possible)
> -------------------------------------------------------------------
>
>                 Key: HADOOP-18217
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18217
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: util
>    Affects Versions: 2.10.1
>         Environment: linux, windows, any version
>            Reporter: Catherinot Remi
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: wtf.java
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> the ShutdownHookManager class uses an executor to run hooks to have a 
> "timeout" notion around them. It does this using a single threaded executor. 
> It can leads to deadlock leaving a never-shutting-down JVM with this 
> execution flow:
>  * JVM need to exit (only daemon threads remaining or someone called 
> System.exit)
>  * ShutdowHookManager kicks in
>  * SHMngr executor start running some hooks
>  * SHMngr executor thread kicks in and, as a side effect, run some code from 
> one of the hook that calls System.exit (as a side effect from an external lib 
> for example)
>  * the executor thread is waiting for a lock because another thread already 
> entered System.exit and has its internal lock, so the executor never returns.
>  * SHMngr never returns
>  * 1st call to System.exit never returns
>  * JVM stuck
>  
> using an executor with a single thread does "fake" timeouts (the task keeps 
> running, you can interrupt it but until it stumble upon some piece of code 
> that is interruptible (like an IO) it will keep running) especially since the 
> executor is a single threaded one. So it has this bug for example :
>  * caller submit 1st hook (bad one that would need 1 hour of runtime and that 
> cannot be interrupted)
>  * executor start 1st hook
>  * caller of the future 1st hook result timeout
>  * caller submit 2nd hook
>  * bug : 1 hook still running, 2nd hook triggers a timeout but never got the 
> chance to run anyway, so 1st faulty hook makes it impossible for any other 
> hook to have a chance to run, so running hooks in a single separate thread 
> does not allow to run other hooks in parallel to long ones.
>  
> If we really really want to timeout the JVM shutdown, even accepting maybe 
> dirty shutdown, it should rather handle the hooks inside the initial thread 
> (not spawning new one(s) so not triggering the deadlock described on the 1st 
> place) and if a timeout was configured, only spawn a single parallel daemon 
> thread that sleeps the timeout delay, and then use Runtime.halt (which bypass 
> the hook system so should not trigger the deadlock). If the normal 
> System.exit ends before the timeout delay everything is fine. If the 
> System.exit took to much time, the JVM is killed and so the reason why this 
> multithreaded shutdown hook implementation was created is satisfied (avoding 
> having hanging JVMs)
>  
> Had the bug with both oracle and open jdk builds, all in 1.8 major version. 
> hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in 
> another thread
>  
> Another solution is of course to configure the timeout AND to have as many 
> threads as needed to run the hooks so to have at least some gain to offset 
> the pain of the dealock scenario
>  
> EDIT: added some logs and reproduced the problem. in fact it is located after 
> triggering all the hook entries and before shutting down the executor. 
> Current code, after running the hooks, creates a new Configuration object and 
> reads the configured timeout from it, applies this timeout to shutdown the 
> executor. I sometimes run with a classloader doing remote classloading, 
> Configuration loads its content using this classloader, so when shutting down 
> the JVM and some network error occurs the classloader fails to load the 
> ressources needed by Configuration. So the code crash before shutting down 
> the executor and ends up inside the thread's default uncaught throwable 
> handler, which was calling System.exit, so got stuck, so shutting down the 
> executor never returned, so does the JVM.
> So, forget about the halt stuff (even if it is a last ressort very robust 
> safety net). Still I'll do a small adjustement to the final executor shutdown 
> code to be slightly more robust to even the strangest exceptions/errors it 
> encounters.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-18217) shutdownhookmanager should not be multithreaded (deadlock possible)

Reply via email to