[
https://issues.apache.org/jira/browse/FLINK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201096#comment-16201096
]
Till Rohrmann commented on FLINK-7540:
--------------------------------------
This is indeed a big problem for setups where you have lower and upper case
hostnames and if you use IPv6 addresses. The underlying problem is as Aljoscha
pointed out that we don't apply the hostname normalisation consistently. For
example, the {{StandaloneHaServices}} assume that hostname are normalized.
However, this is not true in the Yarn, Mesos and Flip-6 case. For the HA mode
this is not a problem since we distribute the hostnames via ZooKeeper.
All the affected cases have in common that they start their {{ActorSystem}} via
the {{BootstrapTools}}. Adding the normalization to
{{AkkaUtils#getAkkaConfig(Configuration, Option[(String, Int)])}} should solve
the problem because all remote actor systems get their hostname configuration
via this method.
> Akka hostnames are not normalised consistently
> ----------------------------------------------
>
> Key: FLINK-7540
> URL: https://issues.apache.org/jira/browse/FLINK-7540
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination, YARN
> Affects Versions: 1.3.1, 1.4.0, 1.3.2
> Reporter: Tong Yan Ou
> Priority: Critical
> Labels: patch
> Fix For: 1.3.3
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> In {{NetUtils.unresolvedHostToNormalizedString()}} we lowercase hostnames,
> Akka seems to preserve the uppercase/lowercase distinctions when starting the
> Actor. This leads to problems because other parts (for example
> {{JobManagerRetriever}}) cannot find the actor leading to a nonfunctional
> cluster.
> h1. Original Issue Text
> Hostnames in my hadoop cluster are like these: “DSJ-RTB-4T-177”,”
> DSJ-signal-900G-71”
> When using the following command:
> ./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024
> ~/flink-1.3.1/examples/batch/WordCount.jar --input
> /user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result
>
> Or
> ./bin/yarn-session.sh -d -jm 6144 -tm 12288 -qu xl_trip -s 24 -n 5 -nm
> "flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip"
> There will be some exceptions at Command line interface:
> java.lang.RuntimeException: Unable to get ClusterClient status from
> Application Client
> at
> org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
> …
> Caused by: org.apache.flink.util.FlinkException: Could not connect to the
> leading JobManager. Please check that the JobManager is running.
> h4. Then the job fails , starting the yarn-session is the same.
> The exceptions of the application log:
> 2017-08-10 17:36:10,334 WARN
> org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to
> retrieve leader gateway and port.
> akka.actor.ActorNotFound: Actor not found for:
> ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/),
> Path(/user/jobmanager)]
> …
> 2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager
> - Resource manager could not register at JobManager
> akka.pattern.AskTimeoutException: Ask timed out on
> [ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/),
> Path(/user/jobmanager)]] after [10000 ms]
> And I found some differences in actor System:
> 2017-08-10 17:35:56,791 INFO org.apache.flink.yarn.YarnJobManager
> - Starting JobManager at
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager.
> 2017-08-10 17:35:56,880 INFO org.apache.flink.yarn.YarnJobManager
> - JobManager
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted
> leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
> 2017-08-10 17:36:00,312 INFO
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend
> listening at 0:0:0:0:0:0:0:0:54921
> 2017-08-10 17:36:00,312 INFO
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with
> JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager on port
> 54921
> 2017-08-10 17:36:00,313 INFO
> org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader
> reachable under
> akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000.
> The JobManager is “akka.tcp://flink@DSJ-signal-4T-248:65082” and the
> JobManagerRetriever is “akka.tcp://flink@dsj-signal-4t-248:65082”
> The hostname of JobManagerRetriever’s actor is lowercase.
> And I read source code,
> Class NetUtils the unresolvedHostToNormalizedString(String host) method of
> line 127:
> public static String unresolvedHostToNormalizedString(String host) {
>
> // Return loopback interface address if host is null
> // This represents the behavior of {@code InetAddress.getByName } and RFC
> 3330 if (host == null) {
> host = InetAddress.getLoopbackAddress().getHostAddress();
> } else { host = host.trim().toLowerCase();
> }
> ...
> }
> It turns the host name into lowercase.
> Therefore, JobManagerRetriever certainly can not find Jobmanager's
> actorSYstem.
> Then I removed the call to the toLowerCase() method in the source code.
> Finally ,I can submit a job in yarn-cluster mode and start a yarn-session.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)