[ https://issues.apache.org/jira/browse/FLINK-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aleksandr Filichkin updated FLINK-8829: --------------------------------------- Description: Hi, We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink job is down due to: _2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]] Caused by: [Connection refused: ip-10-97-34-209.tr-fr-nonprod.aws-int.com/10.97.34.209:42177] 2018-02-16 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]] 2018-02-16 19:00:05,596 INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to JobManager akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177/user/jobmanager|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager]. Triggering connection timeout._ Do you have any ideas how to troubleshoot it? was: Hi, We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink job is down due to: _2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]] Caused by: [Connection refused: ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.209:42177] 2018-02-16 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]] 2018-02-16 19:00:05,596 INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to JobManager akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager]. Triggering connection timeout._ Do you have any ideas how to troubleshoot it? > Flink in EMR(YARN) is down due to Akka communication issue > ---------------------------------------------------------- > > Key: FLINK-8829 > URL: https://issues.apache.org/jira/browse/FLINK-8829 > Project: Flink > Issue Type: Bug > Components: YARN > Affects Versions: 1.3.2 > Reporter: Aleksandr Filichkin > Priority: Major > > Hi, > We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink > job is down due to: > _2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]] > has failed, address is now gated for [5000] ms. Reason: [Association failed > with > [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]] > Caused by: [Connection refused: > ip-10-97-34-209.tr-fr-nonprod.aws-int.com/10.97.34.209:42177] 2018-02-16 > 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable: > [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]] > 2018-02-16 19:00:05,596 INFO > org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to > JobManager > akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177/user/jobmanager|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager]. > Triggering connection timeout._ > Do you have any ideas how to troubleshoot it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)