[
https://issues.apache.org/jira/browse/FLINK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Roman Khachatryan updated FLINK-17933:
--------------------------------------
Description:
When running a job on Yarn cluster (load testing) some jobs result in failures.
Initial symptoms are 0 bytes written/transferred in CSV and failures in logs:
{code:java}
2020-05-17 10:02:32,858 WARN org.apache.flink.runtime.taskmanager.Task [] - Map
-> Flat Map (138/160) (e49f7ea26b633c8035f2a919b1c580c8) switched from RUNNING
to FAILED.{code}
I turned out that all such failures were caused by "Connection reset" from a
single IP, except for on "Leadership lost" error.
Connection reset was likely caused by TM receiving SIGTERM
(container_1589453804748_0118_01_000004 and 5 both on ip-172-31-42-229):
{code:java}
2020-05-17 10:02:31,362 INFO org.apache.flink.yarn.YarnTaskExecutorRunner [] -
RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.{code}
Other TMs received SIGTERM one minute later (all logs were uploaded at the same
time though).
>From the JM it looked like this:
{code:java}
2020-05-17 10:02:23,583 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] -
Trigger heartbeat request.
2020-05-17 10:02:23,587 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] -
Received heartbeat from container_1589453804748_0118_01_000005.
2020-05-17 10:02:23,590 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] -
Received heartbeat from container_1589453804748_0118_01_000006.
2020-05-17 10:02:23,592 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] -
Received heartbeat from container_1589453804748_0118_01_000004.
2020-05-17 10:02:23,595 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] -
Received heartbeat from container_1589453804748_0118_01_000003.
2020-05-17 10:02:23,598 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] -
Received heartbeat from container_1589453804748_0118_01_000002.
2020-05-17 10:02:23,725 DEBUG
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received
acknowledge message for checkpoint 12 from task
459efd2ad8fe2ffe7fffe28530064fe1 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at
container_1589453804748_0118_01_000002 @
ip-172-31-43-69.eu-central-1.compute.internal (dataPort=44625).
2020-05-17 10:02:29,103 DEBUG
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received
acknowledge message for checkpoint 12 from task
266a9326be7e3ec669cce2e6a97ae5b0 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at
container_1589453804748_0118_01_000005 @
ip-172-31-42-229.eu-central-1.compute.internal (dataPort=37329).
2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] -
Association with remote system
[akka.tcp://[email protected]:39999] has
failed, address is now gated for [50] ms. Reason: [Disassociated]
2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] -
Association with remote system
[akka.tcp://[email protected]:42567] has
failed, address is now gated for [50] ms. Reason: [Disassociated]
2020-05-17 10:02:32,900 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Map -> Flat Map
(87/160) (cb77c7002503baa74baf73a3a100c2f2) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
readAddress(..) failed: Connection reset by peer (connection to
'ip-172-31-42-229.eu-central-1.compute.internal/172.31.42.229:37329'){code}
There are also JobManager heartbeat timeouts but they don't correlate with the
issue.
> TaskManager was terminated on Yarn - investigate
> ------------------------------------------------
>
> Key: FLINK-17933
> URL: https://issues.apache.org/jira/browse/FLINK-17933
> Project: Flink
> Issue Type: Task
> Components: Deployment / YARN, Runtime / Task
> Affects Versions: 1.11.0
> Reporter: Roman Khachatryan
> Assignee: Roman Khachatryan
> Priority: Major
>
> When running a job on Yarn cluster (load testing) some jobs result in
> failures.
> Initial symptoms are 0 bytes written/transferred in CSV and failures in logs:
>
> {code:java}
> 2020-05-17 10:02:32,858 WARN org.apache.flink.runtime.taskmanager.Task [] -
> Map -> Flat Map (138/160) (e49f7ea26b633c8035f2a919b1c580c8) switched from
> RUNNING to FAILED.{code}
>
>
> I turned out that all such failures were caused by "Connection reset" from a
> single IP, except for on "Leadership lost" error.
> Connection reset was likely caused by TM receiving SIGTERM
> (container_1589453804748_0118_01_000004 and 5 both on ip-172-31-42-229):
>
> {code:java}
> 2020-05-17 10:02:31,362 INFO org.apache.flink.yarn.YarnTaskExecutorRunner []
> - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.{code}
>
> Other TMs received SIGTERM one minute later (all logs were uploaded at the
> same time though).
>
> From the JM it looked like this:
> {code:java}
> 2020-05-17 10:02:23,583 DEBUG org.apache.flink.runtime.jobmaster.JobMaster []
> - Trigger heartbeat request.
> 2020-05-17 10:02:23,587 DEBUG org.apache.flink.runtime.jobmaster.JobMaster []
> - Received heartbeat from container_1589453804748_0118_01_000005.
> 2020-05-17 10:02:23,590 DEBUG org.apache.flink.runtime.jobmaster.JobMaster []
> - Received heartbeat from container_1589453804748_0118_01_000006.
> 2020-05-17 10:02:23,592 DEBUG org.apache.flink.runtime.jobmaster.JobMaster []
> - Received heartbeat from container_1589453804748_0118_01_000004.
> 2020-05-17 10:02:23,595 DEBUG org.apache.flink.runtime.jobmaster.JobMaster []
> - Received heartbeat from container_1589453804748_0118_01_000003.
> 2020-05-17 10:02:23,598 DEBUG org.apache.flink.runtime.jobmaster.JobMaster []
> - Received heartbeat from container_1589453804748_0118_01_000002.
> 2020-05-17 10:02:23,725 DEBUG
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received
> acknowledge message for checkpoint 12 from task
> 459efd2ad8fe2ffe7fffe28530064fe1 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at
> container_1589453804748_0118_01_000002 @
> ip-172-31-43-69.eu-central-1.compute.internal (dataPort=44625).
> 2020-05-17 10:02:29,103 DEBUG
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received
> acknowledge message for checkpoint 12 from task
> 266a9326be7e3ec669cce2e6a97ae5b0 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at
> container_1589453804748_0118_01_000005 @
> ip-172-31-42-229.eu-central-1.compute.internal (dataPort=37329).
> 2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] -
> Association with remote system
> [akka.tcp://[email protected]:39999] has
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] -
> Association with remote system
> [akka.tcp://[email protected]:42567] has
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2020-05-17 10:02:32,900 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Map -> Flat Map
> (87/160) (cb77c7002503baa74baf73a3a100c2f2) switched from RUNNING to FAILED.
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException:
> readAddress(..) failed: Connection reset by peer (connection to
> 'ip-172-31-42-229.eu-central-1.compute.internal/172.31.42.229:37329'){code}
>
> There are also JobManager heartbeat timeouts but they don't correlate with
> the issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)