[ 
https://issues.apache.org/jira/browse/FLINK-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandr Filichkin updated FLINK-8829:
---------------------------------------
    Description: 
Hi,

We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink 
job is down due to:

_2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system 
[akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]
 has failed, address is now gated for [5000] ms. Reason: [Association failed 
with 
[akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]]
 Caused by: [Connection refused: 
ip-10-97-34-209.tr-fr-nonprod.aws-int.com/10.97.34.209:42177] 2018-02-16 
19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable: 
[akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]
 2018-02-16 19:00:05,596 INFO 
org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to 
JobManager 
akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177/user/jobmanager|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager].
 Triggering connection timeout._

Do you have any ideas how to troubleshoot it?

 

  was:
Hi,

We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink 
job is down due to:

_2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - 
Association with remote system 
[akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]
 has failed, address is now gated for [5000] ms. Reason: [Association failed 
with 
[akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]]
 Caused by: [Connection refused: 
ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.209:42177] 
2018-02-16 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable: 
[akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]
 2018-02-16 19:00:05,596 INFO 
org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to 
JobManager 
akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager].
 Triggering connection timeout._

Do you have any ideas how to troubleshoot it?

 


> Flink in EMR(YARN) is down due to Akka communication issue
> ----------------------------------------------------------
>
>                 Key: FLINK-8829
>                 URL: https://issues.apache.org/jira/browse/FLINK-8829
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.2
>            Reporter: Aleksandr Filichkin
>            Priority: Major
>
> Hi,
> We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink 
> job is down due to:
> _2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system 
> [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]
>  has failed, address is now gated for [5000] ms. Reason: [Association failed 
> with 
> [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]]
>  Caused by: [Connection refused: 
> ip-10-97-34-209.tr-fr-nonprod.aws-int.com/10.97.34.209:42177] 2018-02-16 
> 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected unreachable: 
> [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177]]
>  2018-02-16 19:00:05,596 INFO 
> org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to 
> JobManager 
> akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.com:42177/user/jobmanager|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager].
>  Triggering connection timeout._
> Do you have any ideas how to troubleshoot it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to