[ 
https://issues.apache.org/jira/browse/SAMZA-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320579#comment-14320579
 ] 

Yan Fang commented on SAMZA-507:
--------------------------------

Patch looks good. +1 . Thank you.

> OOME causes container to wedge
> ------------------------------
>
>                 Key: SAMZA-507
>                 URL: https://issues.apache.org/jira/browse/SAMZA-507
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.8.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>             Fix For: 0.9.0
>
>         Attachments: SAMZA-507-0.patch
>
>
> One of our Samza jobs' containers wedged the other day due to an OOME. We had 
> some downtime in an upstream cluster, which caused data to build up. The 
> burst in throughput of messages caused one of our containers to OOME (the 
> container was buffering input messages in memory, and the buffer grew very 
> large very quickly). The OOME occurred on the BrokerProxy thread. Once this 
> happened, no new  messages were consumed, and the rest of the container just 
> sat idle. Only one of the 64 containers in the job failed this way.
> A good first step in fixing this problem would be to have the container 
> recognize that an OOME has occurred, and kill itself with a non-zero exit 
> code. This would cause the AM to restart it, which would in turn un-wedge the 
> process.
> There are longer-term solutions that would help prevent this specific issue, 
> but the general issue of OOME'ing causing a process to wedge shouldn't happen.
> I poked around a bit. The problem is that an OOME on the BrokerProxy thread 
> (or any thread) [just causes that thread to 
> die|http://stackoverflow.com/questions/10327989/does-jvm-terminate-itself-after-outofmemoryerror].
>  The rest of the process continues running. There seem to be two solutions to 
> this problem, if you wish your process to die when an OOME occurs:
> # Use 
> [-XX:OnOutOfMemoryError|http://stackoverflow.com/questions/3871278/how-do-i-make-the-jvm-exit-on-any-outofmemoryexception-even-when-bad-people-try]
>  to execute a kill -9 on the process.
> # Use 
> [Thread.setDefaultUncaughtExceptionHandler|http://www.javamex.com/tutorials/exceptions/exceptions_uncaught_handler.shtml]
>  to catch OOME from all threads, and execute a System.exit().
> (2) seems more appealing to me. Is anyone else aware of any other ways to 
> handle this?
> The proposed behavior is that no thread should ever have an uncaught 
> exception. If one occurs, the entire container should exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to