[ 
https://issues.apache.org/jira/browse/FLINK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek updated FLINK-7540:
------------------------------------
    Priority: Blocker  (was: Critical)

> Akka hostnames are not normalised consistently
> ----------------------------------------------
>
>                 Key: FLINK-7540
>                 URL: https://issues.apache.org/jira/browse/FLINK-7540
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.3.1, 1.4.0, 1.3.2
>            Reporter: Tong Yan Ou
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: patch
>             Fix For: 1.4.0, 1.3.3
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In {{NetUtils.unresolvedHostToNormalizedString()}} we lowercase hostnames, 
> Akka seems to preserve the uppercase/lowercase distinctions when starting the 
> Actor. This leads to problems because other parts (for example 
> {{JobManagerRetriever}}) cannot find the actor leading to a nonfunctional 
> cluster.
> h1. Original Issue Text
> Hostnames in my  hadoop cluster are like these: “DSJ-RTB-4T-177”,” 
> DSJ-signal-900G-71”
> When using the following command:
> ./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024 
> ~/flink-1.3.1/examples/batch/WordCount.jar --input 
> /user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result 
>  
> Or
> ./bin/yarn-session.sh -d -jm 6144  -tm 12288 -qu xl_trip -s 24 -n 5 -nm 
> "flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip"
> There will be some exceptions at Command line interface:
> java.lang.RuntimeException: Unable to get ClusterClient status from 
> Application Client
> at 
> org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
> …
> Caused by: org.apache.flink.util.FlinkException: Could not connect to the 
> leading JobManager. Please check that the JobManager is running.
> h4. Then the job fails , starting the yarn-session is the same.
> The exceptions of the application log:
> 2017-08-10 17:36:10,334 WARN  
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to 
> retrieve leader gateway and port.
> akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
> Path(/user/jobmanager)]
> …
> 2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Resource manager could not register at JobManager
> akka.pattern.AskTimeoutException: Ask timed out on 
> [ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
> Path(/user/jobmanager)]] after [10000 ms]
> And I found some differences in actor System:
> 2017-08-10 17:35:56,791 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Starting JobManager at 
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager.
> 2017-08-10 17:35:56,880 INFO  org.apache.flink.yarn.YarnJobManager            
>               - JobManager 
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted 
> leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
> 2017-08-10 17:36:00,312 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend 
> listening at 0:0:0:0:0:0:0:0:54921
> 2017-08-10 17:36:00,312 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with 
> JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager on port 
> 54921
> 2017-08-10 17:36:00,313 INFO  
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader 
> reachable under 
> akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000.
> The JobManager is  “akka.tcp://flink@DSJ-signal-4T-248:65082” and the 
> JobManagerRetriever is “akka.tcp://flink@dsj-signal-4t-248:65082”
> The hostname of JobManagerRetriever’s actor is lowercase.
> And I read source code,
> Class NetUtils the unresolvedHostToNormalizedString(String host) method of 
> line 127:
>       public static String unresolvedHostToNormalizedString(String host) {    
>         
> // Return loopback interface address if host is null          
> // This represents the behavior of {@code InetAddress.getByName } and RFC 
> 3330                if (host == null) {                     
>    host = InetAddress.getLoopbackAddress().getHostAddress();          
> } else {                      host = host.trim().toLowerCase();               
> }
> ...
> }
> It turns the host name into lowercase.
> Therefore, JobManagerRetriever certainly can not find Jobmanager's 
> actorSYstem.
> Then I removed the call to the toLowerCase() method in the source code.
> Finally ,I can submit a job in yarn-cluster mode and start a yarn-session.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to