[jira] [Commented] (FLINK-7540) Akka hostnames are not normalised consistently

ASF GitHub Bot (JIRA) Wed, 11 Oct 2017 16:27:56 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201187#comment-16201187
 ]


ASF GitHub Bot commented on FLINK-7540:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/4812

    [FLINK-7540] Apply consistent hostname normalization

    ## What is the purpose of the change
    
    The hostname normalization is now applied when generating the remote akka 
config.
    That way it should be ensured that all ActorSystems are bound to a 
normalized
    hostname.
    
    ## Brief change log
    
    - Add hostname normalization to `AkkaUtils#getAkkaConfig`
    - Replace manual ActorSystem instantiation with 
`BootstrapTools#startActorSystem`
    
    ## Verifying this change
    
    - Added `AkkaUtilsTest#getAkkaConfig`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) It affects how 
`ActorSystem` are instantiated.
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixHostnameNormalization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4812.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4812
    
----
commit 00876ead7a4a7492d643f6cba3e784044c54669e
Author: Till Rohrmann <[email protected]>
Date:   2017-10-11T23:17:23Z

    [FLINK-7540] Apply consistent hostname normalization
    
    The hostname normalization is now applied when generationg the remote akka 
config.
    That way it should be ensured that all ActorSystems are bound to a 
normalized
    hostname.

----


> Akka hostnames are not normalised consistently
> ----------------------------------------------
>
>                 Key: FLINK-7540
>                 URL: https://issues.apache.org/jira/browse/FLINK-7540
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.3.1, 1.4.0, 1.3.2
>            Reporter: Tong Yan Ou
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: patch
>             Fix For: 1.3.3
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In {{NetUtils.unresolvedHostToNormalizedString()}} we lowercase hostnames, 
> Akka seems to preserve the uppercase/lowercase distinctions when starting the 
> Actor. This leads to problems because other parts (for example 
> {{JobManagerRetriever}}) cannot find the actor leading to a nonfunctional 
> cluster.
> h1. Original Issue Text
> Hostnames in my  hadoop cluster are like these: “DSJ-RTB-4T-177”,” 
> DSJ-signal-900G-71”
> When using the following command:
> ./bin/flink run -m yarn-cluster -yn 1 -yqu xl_trip -yjm 1024 
> ~/flink-1.3.1/examples/batch/WordCount.jar --input 
> /user/all_trip_dev/test/testcount.txt --output /user/all_trip_dev/test/result 
>  
> Or
> ./bin/yarn-session.sh -d -jm 6144  -tm 12288 -qu xl_trip -s 24 -n 5 -nm 
> "flink-YarnSession-jm6144-tm12288-s24-n5-xl_trip"
> There will be some exceptions at Command line interface:
> java.lang.RuntimeException: Unable to get ClusterClient status from 
> Application Client
> at 
> org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:243)
> …
> Caused by: org.apache.flink.util.FlinkException: Could not connect to the 
> leading JobManager. Please check that the JobManager is running.
> h4. Then the job fails , starting the yarn-session is the same.
> The exceptions of the application log:
> 2017-08-10 17:36:10,334 WARN  
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to 
> retrieve leader gateway and port.
> akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
> Path(/user/jobmanager)]
> …
> 2017-08-10 17:36:10,837 ERROR org.apache.flink.yarn.YarnFlinkResourceManager  
>               - Resource manager could not register at JobManager
> akka.pattern.AskTimeoutException: Ask timed out on 
> [ActorSelection[Anchor(akka.tcp://flink@dsj-signal-4t-248:65082/), 
> Path(/user/jobmanager)]] after [10000 ms]
> And I found some differences in actor System:
> 2017-08-10 17:35:56,791 INFO  org.apache.flink.yarn.YarnJobManager            
>               - Starting JobManager at 
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager.
> 2017-08-10 17:35:56,880 INFO  org.apache.flink.yarn.YarnJobManager            
>               - JobManager 
> akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager was granted 
> leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
> 2017-08-10 17:36:00,312 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend 
> listening at 0:0:0:0:0:0:0:0:54921
> 2017-08-10 17:36:00,312 INFO  
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with 
> JobManager akka.tcp://flink@DSJ-signal-4T-248:65082/user/jobmanager on port 
> 54921
> 2017-08-10 17:36:00,313 INFO  
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader 
> reachable under 
> akka.tcp://flink@dsj-signal-4t-248:65082/user/jobmanager:00000000-0000-0000-0000-000000000000.
> The JobManager is  “akka.tcp://flink@DSJ-signal-4T-248:65082” and the 
> JobManagerRetriever is “akka.tcp://flink@dsj-signal-4t-248:65082”
> The hostname of JobManagerRetriever’s actor is lowercase.
> And I read source code,
> Class NetUtils the unresolvedHostToNormalizedString(String host) method of 
> line 127:
>       public static String unresolvedHostToNormalizedString(String host) {    
>         
> // Return loopback interface address if host is null          
> // This represents the behavior of {@code InetAddress.getByName } and RFC 
> 3330                if (host == null) {                     
>    host = InetAddress.getLoopbackAddress().getHostAddress();          
> } else {                      host = host.trim().toLowerCase();               
> }
> ...
> }
> It turns the host name into lowercase.
> Therefore, JobManagerRetriever certainly can not find Jobmanager's 
> actorSYstem.
> Then I removed the call to the toLowerCase() method in the source code.
> Finally ,I can submit a job in yarn-cluster mode and start a yarn-session.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7540) Akka hostnames are not normalised consistently

Reply via email to