ouyangzhe created FLINK-10850: --------------------------------- Summary: Job may hang on FAILING state if taskmanager updateTaskExecutionState failed Key: FLINK-10850 URL: https://issues.apache.org/jira/browse/FLINK-10850 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 1.5.5 Reporter: ouyangzhe Fix For: 1.8.0
I encountered a job which is oom but hung on FAILING state. It left 3 slots to release, and the corresponding task state is CANCELING. I found the following log in the taskmanager, it seems that taskmanager tried to updateTaskExecutionState from CANCELING to CANCELED, but OOMed. {panel} 2018-11-08 18:01:23,250 INFO org.apache.flink.runtime.taskmanager.Task - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) (46005ba837e fc4ebf783fc92121e55a8) switched from RUNNING to CANCELING. 2018-11-08 18:01:23,257 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code PartialSolution (BulkIteration (B ulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8). 2018-11-08 18:01:44,081 INFO org.apache.flink.runtime.taskmanager.Task - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) (46005ba837e fc4ebf783fc92121e55a8) switched from CANCELING to CANCELED. 2018-11-08 18:01:44,081 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for PartialSolution (BulkIteration (Bulk Iterat ion)) (97/600) (46005ba837efc4ebf783fc92121e55a8). 2018-11-08 18:02:03,097 WARN org.apache.flink.runtime.taskmanager.Task - Task 'PartialSolution (BulkIteration (Bulk Iteration)) (97/600)' did n ot react to cancelling signal for 30 seconds, but is stuck in method: org.apache.flink.shaded.guava18.com.google.common.collect.Maps$EntryFunction$1.apply(Maps.java:86) org.apache.flink.shaded.guava18.com.google.common.collect.Iterators$8.transform(Iterators.java:799) org.apache.flink.shaded.guava18.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48) java.util.AbstractCollection.toArray(AbstractCollection.java:141) org.apache.flink.shaded.guava18.com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:258) org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartitionsProducedBy(ResultPartitionManager.java:100) org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:275) org.apache.flink.runtime.taskmanager.Task.run(Task.java:833) java.lang.Thread.run(Thread.java:745) 2018-11-08 18:02:05,665 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Discarding the results produced by task execution e9141e20871e530dee90 4ddce11adca0. 2018-11-08 18:02:22,536 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Discarding the results produced by task execution 7fac76a5d76247d803e1 f1c47a6b385f. 2018-11-08 18:03:47,210 WARN org.apache.flink.runtime.taskmanager.Task - Task 'PartialSolution (BulkIteration (Bulk Iteration)) (97/600)' did n ot react to cancelling signal for 30 seconds, but is stuck in method: org.apache.flink.runtime.memory.MemoryManager.releaseAll(MemoryManager.java:497) org.apache.flink.runtime.taskmanager.Task.run(Task.java:837) java.lang.Thread.run(Thread.java:745) 2018-11-08 18:03:47,213 INFO org.apache.flink.runtime.taskmanager.Task - Ensuring all FileSystem streams are closed for task PartialSolution (B ulkIteration (Bulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8) [CANCELED] 2018-11-08 18:03:47,215 WARN org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline - An exception was thrown by a user handler while handlin g an exception event ([id: 0x397132f7, /11.10.199.197:33286 => /11.9.137.228:40859] EXCEPTION: java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42) at org.apache.flink.shaded.akka.org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34) at org.apache.flink.shaded.akka.org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134) at org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68) at org.apache.flink.shaded.akka.org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48) at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.extractFrame(FrameDecoder.java:566) at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:391) at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425) at org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) at org.apache.flink.shaded.akka.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.apache.flink.shaded.akka.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.apache.flink.shaded.akka.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {panel} -- This message was sent by Atlassian JIRA (v7.6.3#76005)