Aleksandr Filichkin created FLINK-8829:
------------------------------------------
Summary: Flink in EMR(YARN) is down due to Akka communication issue
Key: FLINK-8829
URL: https://issues.apache.org/jira/browse/FLINK-8829
Project: Flink
Issue Type: Bug
Components: YARN
Affects Versions: 1.3.2
Reporter: Aleksandr Filichkin
Hi,
We have running Flink 1.3.2 app in Amazon EMR. Every week our Flink job is down
due to:
_2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor -
Association with remote system
[akka.tcp://[[email protected]:42177]|mailto:[email protected]:42177]]
has failed, address is now gated for [5000] ms. Reason: [Association failed
with
[akka.tcp://[[email protected]:42177]]|mailto:[email protected]:42177]]]
Caused by: [Connection refused:
ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.209:42177]
2018-02-16 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable:
[akka.tcp://[[email protected]:42177]|mailto:[email protected]:42177]]
2018-02-16 19:00:05,596 INFO
org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to
JobManager
akka.tcp://[[email protected]:42177/user/jobmanager|mailto:[email protected]:42177/user/jobmanager].
Triggering connection timeout._
Do you have any ideas how to troubleshoot it?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)