[ 
https://issues.apache.org/jira/browse/FLINK-8829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383785#comment-16383785
 ] 

Stephan Ewen commented on FLINK-8829:
-------------------------------------

One reason why this could happen that we saw in the past is conflicts between 
Akka's Netty and Netty instances pulled in by Hadoop (through EMR). In some 
cases, that resulted in connections dying even though network connectivity was 
there.

In Flink 1.4.x, we shade and relocate Akka's Netty to ensure such conflicts 
don't happen any more.

You could try to do the following:
  - Upgrade for Flink's 1.4.x line.
  - Try to remove Netty being pulled in via Hadoop. That is not super easy, you 
would need to use a Flink version built against the same Hadoop version as EMR 
runs (Flink should exclude or shade Hadoop's netty) and prevent the Hadoop 
classpath from being added to the Flink classpath.

If you go with option one,  1.4.2 coming out in a few days, 1.4.1 is fine 
except for a classloading bug when using Kafka with a custom watermark 
generator.

> Flink in EMR(YARN) is down due to Akka communication issue
> ----------------------------------------------------------
>
>                 Key: FLINK-8829
>                 URL: https://issues.apache.org/jira/browse/FLINK-8829
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.2
>            Reporter: Aleksandr Filichkin
>            Priority: Major
>
> Hi,
> We have running Flink 1.3.2 app in Amazon EMR with YARN. Every week our Flink 
> job is down due to:
> _2018-02-16 19:00:04,595 WARN akka.remote.ReliableDeliverySupervisor - 
> Association with remote system 
> [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]
>  has failed, address is now gated for [5000] ms. Reason: [Association failed 
> with 
> [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]]
>  Caused by: [Connection refused: 
> ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.209:42177] 
> 2018-02-16 19:00:05,593 WARN akka.remote.RemoteWatcher - Detected 
> unreachable: 
> [akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177]]
>  2018-02-16 19:00:05,596 INFO 
> org.apache.flink.runtime.client.JobSubmissionClientActor - Lost connection to 
> JobManager 
> akka.tcp://[fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager|mailto:fl...@ip-10-97-34-209.tr-fr-nonprod.aws-int.thomsonreuters.com:42177/user/jobmanager].
>  Triggering connection timeout._
> Do you have any ideas how to troubleshoot it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to