[
https://issues.apache.org/jira/browse/FLINK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann closed FLINK-12106.
---------------------------------
Resolution: Duplicate
Thanks for reporting this issue [~Hutchins]. [~maguowei] is right and this
issue should be fixed with FLINK-10941. Hence, closing this issue as a
duplicate.
> Jobmanager is killing FINISHED taskmanger containers, causing exception in
> still running Taskmanagers an
> --------------------------------------------------------------------------------------------------------
>
> Key: FLINK-12106
> URL: https://issues.apache.org/jira/browse/FLINK-12106
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.7.2
> Environment: Hadoop: hdp/2.5.6.0-40
> Flink: 2.7.2
> Reporter: John
> Priority: Major
>
> When running a single flink job on YARN, some of the taskmanger containers
> reach the FINISHED state before others. It appears that, after receiving
> final execution state FINISHED from a taskmanager, jobmanager is waiting ~68
> seconds and then freeing the associated slot in the taskmanager. After and
> additional 60 seconds, jobmanager is stopping the same taskmanger because
> TaskExecutor exceeded the idle timeout.
> Meanwhile, other taskmangers are still working to complete the job. Within
> 10 seconds after the taskmanger container above is stopped, the remaining
> task managers receive an exception due to loss of connection to the stopped
> taskmanager. These exceptions result job failure.
>
> Relevant logs:
> 2019-04-03 13:49:00,013 INFO org.apache.flink.yarn.YarnResourceManager
> - Registering TaskManager with ResourceID
> container_1553017480503_0158_01_000038
> (akka.tcp://flink@hadoop4:42745/user/taskmanager_0) at ResourceManager
> 2019-04-03 13:49:05,900 INFO org.apache.flink.yarn.YarnResourceManager
> - Registering TaskManager with ResourceID
> container_1553017480503_0158_01_000059
> (akka.tcp://flink@hadoop9:55042/user/taskmanager_0) at ResourceManager
>
>
> 2019-04-03 13:48:51,132 INFO org.apache.flink.yarn.YarnResourceManager
> - Received new container:
> container_1553017480503_0158_01_000077 - Remaining pending container
> requests: 6
> 2019-04-03 13:48:52,862 INFO org.apache.flink.yarn.YarnTaskExecutorRunner
> -
> -Dlog.file=/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000077/taskmanager.log
> 2019-04-03 13:48:57,490 INFO
> org.apache.flink.runtime.io.network.netty.NettyServer - Successful
> initialization (took 202 ms). Listening on SocketAddress
> /192.168.230.69:40140.
> 2019-04-03 13:49:12,575 INFO org.apache.flink.yarn.YarnResourceManager
> - Registering TaskManager with ResourceID
> container_1553017480503_0158_01_000077
> (akka.tcp://flink@hadoop9:51525/user/taskmanager_0) at ResourceManager
> 2019-04-03 13:49:12,631 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Allocated
> slot for AllocationID\{42fed3e5a136240c23cc7b394e3249e9}.
> 2019-04-03 14:58:15,188 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor -
> Un-registering task and sending final execution state FINISHED to JobManager
> for task DataSink
> (com.anovadata.alexflinklib.sinks.bucketing.BucketingOutputFormat@26874f2c)
> a4b5fb32830d4561147b2714828109e2.
> 2019-04-03 14:59:23,049 INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Releasing
> idle slot [AllocationID\{42fed3e5a136240c23cc7b394e3249e9}].
> 2019-04-03 14:59:23,058 INFO
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot
> TaskSlot(index:0, state:ACTIVE, resource profile:
> ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647,
> directMemoryInMB=2147483647, nativeMemoryInMB=2147483647,
> networkMemoryInMB=2147483647}, allocationId:
> AllocationID\{42fed3e5a136240c23cc7b394e3249e9}, jobId:
> a6c4e367698c15cdf168d19a89faff1d).
> 2019-04-03 15:00:02,641 INFO org.apache.flink.yarn.YarnResourceManager
> - Stopping container container_1553017480503_0158_01_000077.
> 2019-04-03 15:00:02,646 INFO org.apache.flink.yarn.YarnResourceManager
> - Closing TaskExecutor connection
> container_1553017480503_0158_01_000077 because: TaskExecutor exceeded the
> idle timeout.
>
>
> 2019-04-03 13:48:48,902 INFO org.apache.flink.yarn.YarnTaskExecutorRunner
> -
> -Dlog.file=/data1/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000059/taskmanager.log
> 2019-04-03 14:59:24,677 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem
> columnStore to file. allocated memory: 109479981
> 2019-04-03 15:00:05,696 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter - mem size
> 135014409 > 134217728: flushing 1930100 records to disk.
> 2019-04-03 15:00:05,696 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem
> columnStore to file. allocated memory: 102677684
> 2019-04-03 15:00:08,671 ERROR org.apache.flink.runtime.operators.BatchTask
> - Error in task code: CHAIN Partition -> FlatMap
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This
> indicates that the remote task manager was lost.
> 2019-04-03 15:00:08,714 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor -
> Un-registering task and sending final execution state FAILED to JobManager
> for task CHAIN Partition -> FlatMap
> 2019-04-03 15:00:08,812 INFO org.apache.flink.runtime.taskmanager.Task
> - Attempting to cancel task DataSink ()
> 2019-04-03 15:00:08,812 INFO org.apache.flink.runtime.taskmanager.Task
> - DataSink () switched from RUNNING to CANCELING.
> 2019-04-03 15:00:08,812 INFO org.apache.flink.runtime.taskmanager.Task
> - Triggering cancellation of task code DataSink ()
>
>
> 2019-04-03 13:48:44,562 INFO org.apache.flink.yarn.YarnTaskExecutorRunner
> -
> -Dlog.file=/data8/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000038/taskmanager.log
> 2019-04-03 14:59:18,620 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem
> columnStore to file. allocated memory: 0
> 2019-04-03 14:59:48,088 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter - mem size
> 136179972 > 134217728: flushing 1930100 records to disk.
> 2019-04-03 14:59:48,088 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem
> columnStore to file. allocated memory: 103333893
> 2019-04-03 15:00:08,692 ERROR org.apache.flink.runtime.operators.BatchTask
> - Error in task code: CHAIN Partition -> FlatMap
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This
> indicates that the remote task manager was lost.
> 2019-04-03 15:00:08,741 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor -
> Un-registering task and sending final execution state FAILED to JobManager
> for task CHAIN Partition -> FlatMap
> 2019-04-03 15:00:08,817 INFO org.apache.flink.runtime.taskmanager.Task
> - Attempting to cancel task DataSink ()
> 2019-04-03 15:00:08,817 INFO org.apache.flink.runtime.taskmanager.Task
> - DataSink () switched from RUNNING to CANCELING.
> 2019-04-03 15:00:08,817 INFO org.apache.flink.runtime.taskmanager.Task
> - Triggering cancellation of task code DataSink ()
>
>
> 2019-04-03 15:00:09,196 INFO
> org.apache.flink.runtime.dispatcher.MiniDispatcher - Job
> a6c4e367698c15cdf168d19a89faff1d reached globally terminal state FAILED.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)