[ 
https://issues.apache.org/jira/browse/FLINK-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364082#comment-15364082
 ] 

Robert Metzger commented on FLINK-4137:
---------------------------------------

Thanks a lot for the good analysis of the issue [~till.rohrmann]. I changed the 
configuration parameter to "on" and the JobManager was properly shutting down:

{code}
2016-07-06 10:04:02,727 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 14 (in 1542 ms)
2016-07-06 10:04:06,093 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 15 @ 1467799446093
2016-07-06 10:04:07,832 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 15 (in 1646 ms)
2016-07-06 10:04:11,092 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 16 @ 1467799451092
2016-07-06 10:04:13,140 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 16 (in 1940 ms)
2016-07-06 10:04:16,092 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 17 @ 1467799456092
2016-07-06 10:04:19,859 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
checkpoint 17 (in 3520 ms)
2016-07-06 10:04:21,092 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
checkpoint 18 @ 1467799461092
2016-07-06 10:04:26,495 ERROR akka.actor.ActorSystemImpl                        
            - Uncaught error from thread 
[flink-akka.remote.default-remote-dispatcher-5] shutting down JVM since 
'akka.jvm-exit-on-fatal-error' is enabled
java.lang.OutOfMemoryError: Java heap space
        at java.lang.reflect.Array.newArray(Native Method)
        at java.lang.reflect.Array.newInstance(Array.java:70)
        at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1670)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
        at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at 
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
        at akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
        at 
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
        at scala.util.Try$.apply(Try.scala:161)
        at akka.serialization.Serialization.deserialize(Serialization.scala:98)
        at 
akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
        at 
akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
        at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
        at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
        at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2016-07-06 10:04:26,496 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping 
checkpoint coordinator for job e039626ba2dad012ef06e13ab84741fc
2016-07-06 10:04:26,495 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping 
checkpoint coordinator for job e039626ba2dad012ef06e13ab84741fc
2016-07-06 10:04:26,498 INFO  
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Removing web 
dashboard root cache directory 
/tmp/flink-web-c735e6c2-a9eb-4a56-91e3-215d3e8821b1
2016-07-06 10:04:26,608 INFO  
org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Removing web 
dashboard jar upload directory 
/tmp/flink-web-upload-1cb8d9b5-9b92-4498-9ab4-a993e72827c7
2016-07-06 10:04:26,784 INFO  org.apache.flink.runtime.blob.BlobServer          
            - Stopped BLOB server at 0.0.0.0:44754
2016-07-06 10:04:26,791 INFO  org.apache.flink.yarn.YarnJobManager              
            - Received message for non-existing checkpoint 18
2016-07-06 10:04:26,791 INFO  org.apache.flink.yarn.YarnJobManager              
            - Received message for non-existing checkpoint 18
{code}

> JobManager web frontend does not shut down on OOM exception on JM
> -----------------------------------------------------------------
>
>                 Key: FLINK-4137
>                 URL: https://issues.apache.org/jira/browse/FLINK-4137
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, JobManager, Webfrontend
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>            Priority: Critical
>
> After the following Exception on the JobManager.
> {code}
> 2016-06-30 14:45:06,642 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed 
> checkpoint 379 (in 7017 ms)
> 2016-06-30 14:45:06,642 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering 
> checkpoint 380 @ 1467297906642
> 2016-06-30 14:45:17,902 ERROR akka.actor.ActorSystemImpl                      
>               - Uncaught fatal error from thread 
> [flink-akka.remote.default-remote-dispatcher-6] shutting down ActorSystem 
> [flink]
> java.lang.OutOfMemoryError: Java heap space
>       at com.google.protobuf.ByteString.copyFrom(ByteString.java:192)
>       at 
> com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:324)
>       at 
> akka.remote.WireFormats$SerializedMessage.<init>(WireFormats.java:3030)
>       at 
> akka.remote.WireFormats$SerializedMessage.<init>(WireFormats.java:2980)
>       at 
> akka.remote.WireFormats$SerializedMessage$1.parsePartialFrom(WireFormats.java:3073)
>       at 
> akka.remote.WireFormats$SerializedMessage$1.parsePartialFrom(WireFormats.java:3068)
>       at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>       at akka.remote.WireFormats$RemoteEnvelope.<init>(WireFormats.java:993)
>       at akka.remote.WireFormats$RemoteEnvelope.<init>(WireFormats.java:927)
>       at 
> akka.remote.WireFormats$RemoteEnvelope$1.parsePartialFrom(WireFormats.java:1049)
>       at 
> akka.remote.WireFormats$RemoteEnvelope$1.parsePartialFrom(WireFormats.java:1044)
>       at 
> com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer.<init>(WireFormats.java:241)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer.<init>(WireFormats.java:175)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer$1.parsePartialFrom(WireFormats.java:279)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer$1.parsePartialFrom(WireFormats.java:274)
>       at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:141)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:176)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:188)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:193)
>       at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>       at 
> akka.remote.WireFormats$AckAndEnvelopeContainer.parseFrom(WireFormats.java:409)
>       at 
> akka.remote.transport.AkkaPduProtobufCodec$.decodeMessage(AkkaPduCodec.scala:181)
>       at 
> akka.remote.EndpointReader.akka$remote$EndpointReader$$tryDecodeMessageAndAck(Endpoint.scala:995)
>       at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:928)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>       at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> 2016-06-30 14:45:18,502 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Stopping JobManager 
> akka.tcp://[email protected]:45569/user/jobmanager.
> 2016-06-30 14:45:18,533 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: 
> Custom File Source (1/1) (5f2a1062c796ec6098a0a88227b9eab4) switched from 
> RUNNING to CANCELING
> {code}
> The JobManager JVM keeps running (keeping the YARN session alive) because the 
> web monitor is not stopped on such errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to