[ 
https://issues.apache.org/jira/browse/DRILL-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503699#comment-16503699
 ] 

ASF GitHub Bot commented on DRILL-6468:
---------------------------------------

ilooner commented on a change in pull request #1306: DRILL-6468: 
CatastrophicFailures should forcibly terminate the JVM, not do graceful 
shutdown.
URL: https://github.com/apache/drill/pull/1306#discussion_r193510810
 
 

 ##########
 File path: 
common/src/main/java/org/apache/drill/common/CatastrophicFailure.java
 ##########
 @@ -46,6 +46,7 @@ public static void exit(Throwable e, String message, int 
code) {
       Thread.sleep(1000);
     } catch (InterruptedException e2) {
     }
-    System.exit(code);
+
+    Runtime.getRuntime().halt(code);
 
 Review comment:
   I think this boils down to what you consider a Heap OOM. In my mind if we 
experience a heap OOM we cannot continue executing any code safely, this 
includes any shutdown sequences. The reason why it is not safe is because the 
JVM can get stuck in gc during a memory allocation. So if we run any shutdown 
code we can hang, and that was the issue we were trying to fix in the first 
place. Additionally I'm not totally convinced we would want to bypass Drill's 
close sequence but run the Shutdown hooks for the libraries that Drill uses. 
That could lead to a dangerous situation where Drill continues running when 
some underlying resource are defunct and we can get even more errors and hangs.
   
   At the end of the day though it's just your word against mine. So let's do 
an experiment. I'll go your route and add a flag to bypass only Drill's 
ShutdownThread in the event of a CatastrophicFailure. If the issue of hanging 
shutdowns never occurs again for users, great. If it continues to occur for 
users after this change, then we should agree to make the change to forcibly 
terminate the JVM with halt.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> CatastrophicFailure.exit Should Not Call System.exit
> ----------------------------------------------------
>
>                 Key: DRILL-6468
>                 URL: https://issues.apache.org/jira/browse/DRILL-6468
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Timothy Farkas
>            Assignee: Timothy Farkas
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Drill may never terminate in the event of a Heap OOM. When this happens we 
> see stack traces like the following:
> {code}
> "250387a7-363d-619c-d745-57ae50f19d15:frag:0:0" #104 daemon prio=10 os_prio=0 
> tid=0x00007fd9d1eec190 nid=0xd7d5 in Object.wait() [0x00007fd953de2000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Thread.join(Thread.java:1252)
>         - locked <0x00000005c06bee28> (a 
> org.apache.drill.exec.server.Drillbit$ShutdownThread)
>         at java.lang.Thread.join(Thread.java:1326)
>         at 
> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
>         at 
> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
>         at java.lang.Shutdown.runHooks(Shutdown.java:123)
>         at java.lang.Shutdown.sequence(Shutdown.java:167)
>         at java.lang.Shutdown.exit(Shutdown.java:212)
>         - locked <0x00000005c1d8bb28> (a java.lang.Class for 
> java.lang.Shutdown)
>         at java.lang.Runtime.exit(Runtime.java:109)
>         at java.lang.System.exit(System.java:971)
>         at 
> org.apache.drill.common.CatastrophicFailure.exit(CatastrophicFailure.java:49)
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:246)
>         at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> Here CatastrophicFailure.exit is being called when we encounter a Heap OOM. 
> Then we call System.exit to terminate the java process. The only issue is 
> that System.exit run's Drill's normal shutdown hook and tries to do a 
> graceful shutdown. In the case of a Heap OOM we cannot do this reliable 
> because there physically isn't enough memory to proceed executing our code. 
> The JVM likely gets stuck a various places waiting on garbage collection and 
> object allocations on the heap and the Drillbit stops making progress.
> *Solution To Hanging Shutdown*
> There are two kinds of OutOfMemory exceptions in Drill. Direct Memory OOMs 
> and Heap OOMs. Typically Direct Memory OOMs are recoverable because Drill 
> uses Direct Memory to store data only, so we can fail a query and lose data 
> and recover. Heap OOMs are unrecoverable because we actually need the Heap to 
> execute our code, and if we can't use the heap then we basically can't run 
> our code reliably.
> When Drill experiences a catastrophic failure we should not call System.exit 
> because then we will try to shutdown gracefully. In the event of a 
> catastrophic failure like a Heap OOM we cannot recover so we should 
> forcefully terminate the jvm with Runtime.getRuntime().halt .
> This will make Drill shutdown promptly in the event of a Heap OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to