[jira] [Commented] (FLINK-7540) Akka hostnames are not normalised consistently

Till Rohrmann (JIRA) Wed, 11 Oct 2017 15:50:41 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201096#comment-16201096
 ]


Till Rohrmann commented on FLINK-7540:
--------------------------------------

This is indeed a big problem for setups where you have lower and upper case 
hostnames and if you use IPv6 addresses. The underlying problem is as Aljoscha 
pointed out that we don't apply the hostname normalisation consistently. For 
example, the {{StandaloneHaServices}} assume that hostname are normalized. 
However, this is not true in the Yarn, Mesos and Flip-6 case. For the HA mode 
this is not a problem since we distribute the hostnames via ZooKeeper.

All the affected cases have in common that they start their {{ActorSystem}} via 
the {{BootstrapTools}}. Adding the normalization to 
{{AkkaUtils#getAkkaConfig(Configuration, Option[(String, Int)])}} should solve 
the problem because all remote actor systems get their hostname configuration 
via this method.

> Akka hostnames are not normalised consistently
> ----------------------------------------------
>
>                 Key: FLINK-7540
>                 URL: https://issues.apache.org/jira/browse/FLINK-7540
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.3.1, 1.4.0, 1.3.2
>            Reporter: Tong Yan Ou
>            Priority: Critical
>              Labels: patch
>             Fix For: 1.3.3
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In {{NetUtils.unresolvedHostToNormalizedString()}} we lowercase hostnames, 
> Akka seems to preserve the uppercase/lowercase distinctions when starting the 
> Actor. This leads to problems because other parts (for example 
> {{JobManagerRetriever}}) cannot find the actor leading to a nonfunctional 
> cluster.
> h1. Original Issue Text
> Hostnames in my  hadoop cluster are like these: “DSJ-RTB-4T-177”,” 
> DSJ-signal-900G-71”
> When using the following command:
> ./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024 
> ~/flink-1.3.1/examples/batch/WordCount.jar --input 
> /user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result 
>  
> Or
> ./bin/yarn-session.sh -d -jm 6144  -tm 12288 -qu xl_trip -s 24 -n 5 -nm 
> "flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip"
> There will be some exceptions at Command line interface:
> java.lang.RuntimeException: Unable to get ClusterClient status from 
> Application Client
> at 
> org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
> …
> Caused by: org.apache.flink.util.FlinkException: Could not connect to the 
> leading JobManager. Please check that the JobManager is running.
> h4. Then the job fails , starting the yarn-session is the same.
> The exceptions of the application log:
> 2017-08-10 17:36:10,334 WARN  
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to 
> retrieve leader gateway and port.
> akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
> Path(/user/jobmanager)]
> …
> 2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Resource manager could not register at JobManager
> akka.pattern.AskTimeoutException: Ask timed out on 
> [ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
> Path(/user/jobmanager)]] after [10000 ms]
> And I found some differences in actor System:
> 2017-08-10 17:35:56,791 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Starting JobManager at 
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager.
> 2017-08-10 17:35:56,880 INFO  org.apache.flink.yarn.YarnJobManager            
>               - JobManager 
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted 
> leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
> 2017-08-10 17:36:00,312 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend 
> listening at 0:0:0:0:0:0:0:0:54921
> 2017-08-10 17:36:00,312 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with 
> JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager on port 
> 54921
> 2017-08-10 17:36:00,313 INFO  
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader 
> reachable under 
> akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000.
> The JobManager is  “akka.tcp://flink@DSJ-signal-4T-248:65082” and the 
> JobManagerRetriever is “akka.tcp://flink@dsj-signal-4t-248:65082”
> The hostname of JobManagerRetriever’s actor is lowercase.
> And I read source code,
> Class NetUtils the unresolvedHostToNormalizedString(String host) method of 
> line 127:
>       public static String unresolvedHostToNormalizedString(String host) {    
>         
> // Return loopback interface address if host is null          
> // This represents the behavior of {@code InetAddress.getByName } and RFC 
> 3330                if (host == null) {                     
>    host = InetAddress.getLoopbackAddress().getHostAddress();          
> } else {                      host = host.trim().toLowerCase();               
> }
> ...
> }
> It turns the host name into lowercase.
> Therefore, JobManagerRetriever certainly can not find Jobmanager's 
> actorSYstem.
> Then I removed the call to the toLowerCase() method in the source code.
> Finally ,I can submit a job in yarn-cluster mode and start a yarn-session.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7540) Akka hostnames are not normalised consistently

Reply via email to