[ 
https://issues.apache.org/jira/browse/FLINK-10850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann closed FLINK-10850.
---------------------------------
    Resolution: Not A Problem

This should not longer be a problem in Flink 1.10. Please reopen if you should 
encounter this problem again.

> Job may hang on FAILING state if taskmanager updateTaskExecutionState failed
> ----------------------------------------------------------------------------
>
>                 Key: FLINK-10850
>                 URL: https://issues.apache.org/jira/browse/FLINK-10850
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.5
>            Reporter: ouyangzhe
>            Priority: Major
>
> I encountered a job which is oom but hung on FAILING state. It left 3 slots 
> to release, and the corresponding task state is CANCELING.
> I found the following log in the taskmanager, it seems that taskmanager tried 
> to updateTaskExecutionState from CANCELING to CANCELED, but OOMed.
> {noformat}
> 2018-11-08 18:01:23,250 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) 
> (46005ba837e
> fc4ebf783fc92121e55a8) switched from RUNNING to CANCELING.
> 2018-11-08 18:01:23,257 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Triggering cancellation of task code PartialSolution 
> (BulkIteration (B
> ulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8).
> 2018-11-08 18:01:44,081 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - PartialSolution (BulkIteration (Bulk Iteration)) (97/600) 
> (46005ba837e
> fc4ebf783fc92121e55a8) switched from CANCELING to CANCELED.
> 2018-11-08 18:01:44,081 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Freeing task resources for PartialSolution (BulkIteration 
> (Bulk Iterat
> ion)) (97/600) (46005ba837efc4ebf783fc92121e55a8).
> 2018-11-08 18:02:03,097 WARN  org.apache.flink.runtime.taskmanager.Task       
>               - Task 'PartialSolution (BulkIteration (Bulk Iteration)) 
> (97/600)' did n
> ot react to cancelling signal for 30 seconds, but is stuck in method:
>  
> org.apache.flink.shaded.guava18.com.google.common.collect.Maps$EntryFunction$1.apply(Maps.java:86)
> org.apache.flink.shaded.guava18.com.google.common.collect.Iterators$8.transform(Iterators.java:799)
> org.apache.flink.shaded.guava18.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
> java.util.AbstractCollection.toArray(AbstractCollection.java:141)
> org.apache.flink.shaded.guava18.com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:258)
> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartitionsProducedBy(ResultPartitionManager.java:100)
> org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:275)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:833)
> java.lang.Thread.run(Thread.java:745)
> 2018-11-08 18:02:05,665 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Discarding 
> the results produced by task execution e9141e20871e530dee90
> 4ddce11adca0.
> 2018-11-08 18:02:22,536 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Discarding 
> the results produced by task execution 7fac76a5d76247d803e1
> f1c47a6b385f.
> 2018-11-08 18:03:47,210 WARN  org.apache.flink.runtime.taskmanager.Task       
>               - Task 'PartialSolution (BulkIteration (Bulk Iteration)) 
> (97/600)' did n
> ot react to cancelling signal for 30 seconds, but is stuck in method:
>  
> org.apache.flink.runtime.memory.MemoryManager.releaseAll(MemoryManager.java:497)
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:837)
> java.lang.Thread.run(Thread.java:745)
> 2018-11-08 18:03:47,213 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Ensuring all FileSystem streams are closed for task 
> PartialSolution (B
> ulkIteration (Bulk Iteration)) (97/600) (46005ba837efc4ebf783fc92121e55a8) 
> [CANCELED]
> 2018-11-08 18:03:47,215 WARN  
> org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline  
> - An exception was thrown by a user handler while handlin
> g an exception event ([id: 0x397132f7, /11.10.199.197:33286 => 
> /11.9.137.228:40859] EXCEPTION: java.lang.OutOfMemoryError: GC overhead limit 
> exceeded)
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBuffer.<init>(HeapChannelBuffer.java:42)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.buffer.BigEndianHeapChannelBuffer.<init>(BigEndianHeapChannelBuffer.java:34)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.buffer.ChannelBuffers.buffer(ChannelBuffers.java:134)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.buffer.HeapChannelBufferFactory.getBuffer(HeapChannelBufferFactory.java:68)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.buffer.AbstractChannelBufferFactory.getBuffer(AbstractChannelBufferFactory.java:48)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.extractFrame(FrameDecoder.java:566)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:391)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>         at 
> org.apache.flink.shaded.akka.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to