[jira] [Updated] (DRILL-6468) CatastrophicFailure.exit Should Not Call System.exit

Timothy Farkas (JIRA) Tue, 05 Jun 2018 13:48:22 -0700


     [ 
https://issues.apache.org/jira/browse/DRILL-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Timothy Farkas updated DRILL-6468:
----------------------------------
    Description: 
Drill may never terminate in the event of a Heap OOM. When this happens we see 
stack traces like the following:

{code}
"250387a7-363d-619c-d745-57ae50f19d15:frag:0:0" #104 daemon prio=10 os_prio=0 
tid=0x00007fd9d1eec190 nid=0xd7d5 in Object.wait() [0x00007fd953de2000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x00000005c06bee28> (a 
org.apache.drill.exec.server.Drillbit$ShutdownThread)
        at java.lang.Thread.join(Thread.java:1326)
        at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
        at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
        at java.lang.Shutdown.runHooks(Shutdown.java:123)
        at java.lang.Shutdown.sequence(Shutdown.java:167)
        at java.lang.Shutdown.exit(Shutdown.java:212)
        - locked <0x00000005c1d8bb28> (a java.lang.Class for java.lang.Shutdown)
        at java.lang.Runtime.exit(Runtime.java:109)
        at java.lang.System.exit(System.java:971)
        at 
org.apache.drill.common.CatastrophicFailure.exit(CatastrophicFailure.java:49)
        at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:246)
        at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

Here CatastrophicFailure.exit is being called when we encounter a Heap OOM. 
Then we call System.exit to terminate the java process. The only issue is that 
System.exit run's Drill's normal shutdown hook and tries to do a graceful 
shutdown. In the case of a Heap OOM we cannot do this reliable because there 
physically isn't enough memory to proceed executing our code. The JVM likely 
gets stuck a various places waiting on garbage collection and object 
allocations on the heap and the Drillbit stops making progress.

*Solution To Hanging Shutdown*

There are two kinds of OutOfMemory exceptions in Drill. Direct Memory OOMs and 
Heap OOMs. Typically Direct Memory OOMs are recoverable because Drill uses 
Direct Memory to store data only, so we can fail a query and lose data and 
recover. Heap OOMs are unrecoverable because we actually need the Heap to 
execute our code, and if we can't use the heap then we basically can't run our 
code reliably.

When Drill experiences a catastrophic failure we should not call System.exit 
because then we will try to shutdown gracefully. In the event of a catastrophic 
failure like a Heap OOM we cannot recover so we should forcefully terminate the 
jvm with Runtime.getRuntime().halt .

This will make Drill shutdown promptly in the event of a Heap OOM.

  was:
Drill may never terminate in the event of a Heap OOM. When this happens we see 
stack traces like the following:

{code}
"250387a7-363d-619c-d745-57ae50f19d15:frag:0:0" #104 daemon prio=10 os_prio=0 
tid=0x00007fd9d1eec190 nid=0xd7d5 in Object.wait() [0x00007fd953de2000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x00000005c06bee28> (a 
org.apache.drill.exec.server.Drillbit$ShutdownThread)
        at java.lang.Thread.join(Thread.java:1326)
        at 
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
        at 
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
        at java.lang.Shutdown.runHooks(Shutdown.java:123)
        at java.lang.Shutdown.sequence(Shutdown.java:167)
        at java.lang.Shutdown.exit(Shutdown.java:212)
        - locked <0x00000005c1d8bb28> (a java.lang.Class for java.lang.Shutdown)
        at java.lang.Runtime.exit(Runtime.java:109)
        at java.lang.System.exit(System.java:971)
        at 
org.apache.drill.common.CatastrophicFailure.exit(CatastrophicFailure.java:49)
        at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:246)
        at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
{code}

Here CatastrophicFailure.exit is being called when we encounter a Heap OOM. 
Then we call System.exit to terminate the java process. The only issue is that 
System.exit run's Drill's normal shutdown hook and tries to do a graceful 
shutdown. In the case of a Heap OOM we cannot do this reliable because there 
physically isn't enough memory to proceed executing our code. The JVM likely 
gets stuck a various places waiting on garbage collection and object 
allocations on the heap and the Drillbit stops making progress.
*Improving Drill's Behavoir*

*Solution To Hanging Shutdown*

There are two kinds of OutOfMemory exceptions in Drill. Direct Memory OOMs and 
Heap OOMs. Typically Direct Memory OOMs are recoverable because Drill uses 
Direct Memory to store data only, so we can fail a query and lose data and 
recover. Heap OOMs are unrecoverable because we actually need the Heap to 
execute our code, and if we can't use the heap then we basically can't run our 
code reliably.

When Drill experiences a catastrophic failure we should not call System.exit 
because then we will try to shutdown gracefully. In the event of a catastrophic 
failure like a Heap OOM we cannot recover so we should forcefully terminate the 
jvm with Runtime.getRuntime().halt .

This will make Drill shutdown promptly in the event of a Heap OOM.


> CatastrophicFailure.exit Should Not Call System.exit
> ----------------------------------------------------
>
>                 Key: DRILL-6468
>                 URL: https://issues.apache.org/jira/browse/DRILL-6468
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Timothy Farkas
>            Assignee: Timothy Farkas
>            Priority: Major
>
> Drill may never terminate in the event of a Heap OOM. When this happens we 
> see stack traces like the following:
> {code}
> "250387a7-363d-619c-d745-57ae50f19d15:frag:0:0" #104 daemon prio=10 os_prio=0 
> tid=0x00007fd9d1eec190 nid=0xd7d5 in Object.wait() [0x00007fd953de2000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Thread.join(Thread.java:1252)
>         - locked <0x00000005c06bee28> (a 
> org.apache.drill.exec.server.Drillbit$ShutdownThread)
>         at java.lang.Thread.join(Thread.java:1326)
>         at 
> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
>         at 
> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
>         at java.lang.Shutdown.runHooks(Shutdown.java:123)
>         at java.lang.Shutdown.sequence(Shutdown.java:167)
>         at java.lang.Shutdown.exit(Shutdown.java:212)
>         - locked <0x00000005c1d8bb28> (a java.lang.Class for 
> java.lang.Shutdown)
>         at java.lang.Runtime.exit(Runtime.java:109)
>         at java.lang.System.exit(System.java:971)
>         at 
> org.apache.drill.common.CatastrophicFailure.exit(CatastrophicFailure.java:49)
>         at 
> org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:246)
>         at 
> org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> Here CatastrophicFailure.exit is being called when we encounter a Heap OOM. 
> Then we call System.exit to terminate the java process. The only issue is 
> that System.exit run's Drill's normal shutdown hook and tries to do a 
> graceful shutdown. In the case of a Heap OOM we cannot do this reliable 
> because there physically isn't enough memory to proceed executing our code. 
> The JVM likely gets stuck a various places waiting on garbage collection and 
> object allocations on the heap and the Drillbit stops making progress.
> *Solution To Hanging Shutdown*
> There are two kinds of OutOfMemory exceptions in Drill. Direct Memory OOMs 
> and Heap OOMs. Typically Direct Memory OOMs are recoverable because Drill 
> uses Direct Memory to store data only, so we can fail a query and lose data 
> and recover. Heap OOMs are unrecoverable because we actually need the Heap to 
> execute our code, and if we can't use the heap then we basically can't run 
> our code reliably.
> When Drill experiences a catastrophic failure we should not call System.exit 
> because then we will try to shutdown gracefully. In the event of a 
> catastrophic failure like a Heap OOM we cannot recover so we should 
> forcefully terminate the jvm with Runtime.getRuntime().halt .
> This will make Drill shutdown promptly in the event of a Heap OOM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (DRILL-6468) CatastrophicFailure.exit Should Not Call System.exit

Reply via email to