[ 
https://issues.apache.org/jira/browse/FLINK-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364094#comment-15364094
 ] 

ASF GitHub Bot commented on FLINK-4137:
---------------------------------------

GitHub user rmetzger opened a pull request:

    https://github.com/apache/flink/pull/2205

    [FLINK-4137] Tell Akka to shut down the JVM on fatal errors

    Currently, the ProcessReaper might not receive the `Terminated` message 
when a fatal error like an OOM error leaves the actor system in an inconsistent 
state.
    
    By changing the default value of the configuration, akka is shutting down 
the JVM on such errors.
    
    See also the JIRA discussion.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rmetzger/flink flink4137

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2205.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2205
    
----
commit da818d9eddd5d35e1f8892f68deae1af7d1c36ac
Author: Robert Metzger <[email protected]>
Date:   2016-07-06T10:13:21Z

    [FLINK-4137] Tell Akka to shut down the JVM on fatal errors

----


> JobManager web frontend does not shut down on OOM exception on JM
> -----------------------------------------------------------------
>
>                 Key: FLINK-4137
>                 URL: https://issues.apache.org/jira/browse/FLINK-4137
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, JobManager, Webfrontend
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>            Priority: Critical
>
> After the following Exception on the JobManager.
> {code}
> 2016-06-30 14:45:06,642 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 379 (in 7017 ms)
> 2016-06-30 14:45:06,642 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 380 @ 1467297906642
> 2016-06-30 14:45:17,902 ERROR akka.actor.ActorSystemImpl                      
>               - Uncaught fatal error from thread 
> [flink-akka.remote.default-remote-dispatcher-6] shutting down ActorSystem 
> [flink]
> java.lang.OutOfMemoryError: Java heap space
>       at com.google.protobuf.ByteString.copyFrom(ByteString.java:192)
>       at 
> com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:324)
>       at 
> akka.remote.WireFormats$SerializedMessage.<init>(WireFormats.java:3030)
>       at 
> akka.remote.WireFormats$SerializedMessage.<init>(WireFormats.java:2980)
>       at 
> akka.remote.WireFormats$SerializedMessage$1.parsePartialFrom(WireFormats.java:3073)
>       at 
> akka.remote.WireFormats$SerializedMessage$1.parsePartialFrom(WireFormats.java:3068)
>       at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>       at akka.remote.WireFormats$RemoteEnvelope.<init>(WireFormats.java:993)
>       at akka.remote.WireFormats$RemoteEnvelope.<init>(WireFormats.java:927)
>       at 
> akka.remote.WireFormats$RemoteEnvelope$1.parsePartialFrom(WireFormats.java:1049)
>       at 
> akka.remote.WireFormats$RemoteEnvelope$1.parsePartialFrom(WireFormats.java:1044)
>       at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer.<init>(WireFormats.java:241)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer.<init>(WireFormats.java:175)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer$1.parsePartialFrom(WireFormats.java:279)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer$1.parsePartialFrom(WireFormats.java:274)
>       at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:141)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:176)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer.parseFrom(WireFormats.java:409)
>       at 
> akka.remote.transport.AkkaPduProtobufCodec$.decodeMessage(AkkaPduCodec.scala:181)
>       at 
> akka.remote.EndpointReader.akka$remote$EndpointReader$$tryDecodeMessageAndAck(Endpoint.scala:995)
>       at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:928)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>       at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> 2016-06-30 14:45:18,502 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Stopping JobManager 
> akka.tcp://[email protected]:45569/user/jobmanager.
> 2016-06-30 14:45:18,533 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
> Custom File Source (1/1) (5f2a1062c796ec6098a0a88227b9eab4) switched from 
> RUNNING to CANCELING
> {code}
> The JobManager JVM keeps running (keeping the YARN session alive) because the 
> web monitor is not stopped on such errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to