[ https://issues.apache.org/jira/browse/HADOOP-18217?focusedWorklogId=789684&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-789684 ]
ASF GitHub Bot logged work on HADOOP-18217: ------------------------------------------- Author: ASF GitHub Bot Created on: 11/Jul/22 17:46 Start Date: 11/Jul/22 17:46 Worklog Time Spent: 10m Work Description: HerCath commented on code in PR #4255: URL: https://github.com/apache/hadoop/pull/4255#discussion_r918200036 ########## hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestExitUtil.java: ########## @@ -0,0 +1,122 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.hadoop.util; + +import static org.apache.hadoop.test.LambdaTestUtils.intercept; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertNull; +import static org.junit.Assert.assertSame; +import static org.junit.Assert.assertTrue; + +import org.junit.Test; + +import org.apache.hadoop.util.ExitUtil.ExitException; +import org.apache.hadoop.util.ExitUtil.HaltException; + + +public class TestExitUtil { + + @Test + public void testGetSetExitExceptions() throws Throwable { + // prepare states and exceptions + ExitUtil.disableSystemExit(); + ExitUtil.resetFirstExitException(); + ExitException ee1 = new ExitException(1, "TestExitUtil forged 1st ExitException"); + ExitException ee2 = new ExitException(2, "TestExitUtil forged 2nd ExitException"); + try { + // check proper initial settings + assertFalse("ExitUtil.terminateCalled initial value should be false", + ExitUtil.terminateCalled()); + assertNull("ExitUtil.getFirstExitException initial value should be null", + ExitUtil.getFirstExitException()); + + // simulate/check 1st call + ExitException ee = intercept(ExitException.class, ()->ExitUtil.terminate(ee1)); + assertSame("ExitUtil.terminate should have rethrown its ExitException argument but it " + + "had thrown something else", ee1, ee); + assertTrue("ExitUtil.terminateCalled should be true after 1st ExitUtil.terminate call", + ExitUtil.terminateCalled()); + assertSame("ExitUtil.terminate should store its 1st call's ExitException", + ee1, ExitUtil.getFirstExitException()); + + // simulate/check 2nd call not overwritting 1st one + ee = intercept(ExitException.class, ()->ExitUtil.terminate(ee2)); + assertSame("ExitUtil.terminate should have rethrown its HaltException argument but it " + + "had thrown something else", ee2, ee); + assertTrue("ExitUtil.terminateCalled should still be true after 2nd ExitUtil.terminate call", + ExitUtil.terminateCalled()); + // 2nd call rethrown the 2nd ExitException yet only the 1st only should have been stored + assertSame("ExitUtil.terminate when called twice should only remember 1st call's " + + "ExitException", ee1, ExitUtil.getFirstExitException()); + + // simulate cleanup, also tries to make sure state is ok for all junit still has to do + ExitUtil.resetFirstExitException(); + assertFalse("ExitUtil.terminateCalled should be false after " + + "ExitUtil.resetFirstExitException call", ExitUtil.terminateCalled()); + assertNull("ExitUtil.getFirstExitException should be null after " + + "ExitUtil.resetFirstExitException call", ExitUtil.getFirstExitException()); + } finally { + // cleanup + ExitUtil.resetFirstExitException(); Review Comment: Ok, i'll then also had a beforeClass as its counter part. Yes there is no API to re-enable. Maybe, to limit access i can add a package protected enableSystemExit and enableSystemHalt so the test class can use them but not really everyone. what do you think ? Issue Time Tracking ------------------- Worklog Id: (was: 789684) Time Spent: 2h 50m (was: 2h 40m) > shutdownhookmanager should not be multithreaded (deadlock possible) > ------------------------------------------------------------------- > > Key: HADOOP-18217 > URL: https://issues.apache.org/jira/browse/HADOOP-18217 > Project: Hadoop Common > Issue Type: Bug > Components: util > Affects Versions: 2.10.1 > Environment: linux, windows, any version > Reporter: Catherinot Remi > Priority: Minor > Labels: pull-request-available > Attachments: wtf.java > > Time Spent: 2h 50m > Remaining Estimate: 0h > > the ShutdownHookManager class uses an executor to run hooks to have a > "timeout" notion around them. It does this using a single threaded executor. > It can leads to deadlock leaving a never-shutting-down JVM with this > execution flow: > * JVM need to exit (only daemon threads remaining or someone called > System.exit) > * ShutdowHookManager kicks in > * SHMngr executor start running some hooks > * SHMngr executor thread kicks in and, as a side effect, run some code from > one of the hook that calls System.exit (as a side effect from an external lib > for example) > * the executor thread is waiting for a lock because another thread already > entered System.exit and has its internal lock, so the executor never returns. > * SHMngr never returns > * 1st call to System.exit never returns > * JVM stuck > > using an executor with a single thread does "fake" timeouts (the task keeps > running, you can interrupt it but until it stumble upon some piece of code > that is interruptible (like an IO) it will keep running) especially since the > executor is a single threaded one. So it has this bug for example : > * caller submit 1st hook (bad one that would need 1 hour of runtime and that > cannot be interrupted) > * executor start 1st hook > * caller of the future 1st hook result timeout > * caller submit 2nd hook > * bug : 1 hook still running, 2nd hook triggers a timeout but never got the > chance to run anyway, so 1st faulty hook makes it impossible for any other > hook to have a chance to run, so running hooks in a single separate thread > does not allow to run other hooks in parallel to long ones. > > If we really really want to timeout the JVM shutdown, even accepting maybe > dirty shutdown, it should rather handle the hooks inside the initial thread > (not spawning new one(s) so not triggering the deadlock described on the 1st > place) and if a timeout was configured, only spawn a single parallel daemon > thread that sleeps the timeout delay, and then use Runtime.halt (which bypass > the hook system so should not trigger the deadlock). If the normal > System.exit ends before the timeout delay everything is fine. If the > System.exit took to much time, the JVM is killed and so the reason why this > multithreaded shutdown hook implementation was created is satisfied (avoding > having hanging JVMs) > > Had the bug with both oracle and open jdk builds, all in 1.8 major version. > hadoop 2.6 and 2.7 did not have the issue because they do not run hooks in > another thread > > Another solution is of course to configure the timeout AND to have as many > threads as needed to run the hooks so to have at least some gain to offset > the pain of the dealock scenario > > EDIT: added some logs and reproduced the problem. in fact it is located after > triggering all the hook entries and before shutting down the executor. > Current code, after running the hooks, creates a new Configuration object and > reads the configured timeout from it, applies this timeout to shutdown the > executor. I sometimes run with a classloader doing remote classloading, > Configuration loads its content using this classloader, so when shutting down > the JVM and some network error occurs the classloader fails to load the > ressources needed by Configuration. So the code crash before shutting down > the executor and ends up inside the thread's default uncaught throwable > handler, which was calling System.exit, so got stuck, so shutting down the > executor never returned, so does the JVM. > So, forget about the halt stuff (even if it is a last ressort very robust > safety net). Still I'll do a small adjustement to the final executor shutdown > code to be slightly more robust to even the strangest exceptions/errors it > encounters. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org