[
https://issues.apache.org/jira/browse/SAMZA-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320579#comment-14320579
]
Yan Fang commented on SAMZA-507:
--------------------------------
Patch looks good. +1 . Thank you.
> OOME causes container to wedge
> ------------------------------
>
> Key: SAMZA-507
> URL: https://issues.apache.org/jira/browse/SAMZA-507
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.8.0
> Reporter: Chris Riccomini
> Assignee: Chris Riccomini
> Fix For: 0.9.0
>
> Attachments: SAMZA-507-0.patch
>
>
> One of our Samza jobs' containers wedged the other day due to an OOME. We had
> some downtime in an upstream cluster, which caused data to build up. The
> burst in throughput of messages caused one of our containers to OOME (the
> container was buffering input messages in memory, and the buffer grew very
> large very quickly). The OOME occurred on the BrokerProxy thread. Once this
> happened, no new messages were consumed, and the rest of the container just
> sat idle. Only one of the 64 containers in the job failed this way.
> A good first step in fixing this problem would be to have the container
> recognize that an OOME has occurred, and kill itself with a non-zero exit
> code. This would cause the AM to restart it, which would in turn un-wedge the
> process.
> There are longer-term solutions that would help prevent this specific issue,
> but the general issue of OOME'ing causing a process to wedge shouldn't happen.
> I poked around a bit. The problem is that an OOME on the BrokerProxy thread
> (or any thread) [just causes that thread to
> die|http://stackoverflow.com/questions/10327989/does-jvm-terminate-itself-after-outofmemoryerror].
> The rest of the process continues running. There seem to be two solutions to
> this problem, if you wish your process to die when an OOME occurs:
> # Use
> [-XX:OnOutOfMemoryError|http://stackoverflow.com/questions/3871278/how-do-i-make-the-jvm-exit-on-any-outofmemoryexception-even-when-bad-people-try]
> to execute a kill -9 on the process.
> # Use
> [Thread.setDefaultUncaughtExceptionHandler|http://www.javamex.com/tutorials/exceptions/exceptions_uncaught_handler.shtml]
> to catch OOME from all threads, and execute a System.exit().
> (2) seems more appealing to me. Is anyone else aware of any other ways to
> handle this?
> The proposed behavior is that no thread should ever have an uncaught
> exception. If one occurs, the entire container should exit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)