[ 
https://issues.apache.org/jira/browse/FLINK-6376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026726#comment-16026726
 ] 

ASF GitHub Bot commented on FLINK-6376:
---------------------------------------

Github user EronWright commented on the issue:

    https://github.com/apache/flink/pull/3776
  
    @Rucongzhang thanks for the contribution.  I think I understand the problem 
and your solution, which I will recap.  I also found YARN-2704 to be useful 
background.
    
    *Problem*:
    1. YARN log aggregation depends on an HDFS delegation token, which it 
obtains from container token storage not from the UGI.  In keytab mode, the 
Flink client doesn't upload any delegation tokens, causing log aggregation to 
fail.
    2. The Flink cluster doesn't renew delegation tokens.  Note: Flink does 
renew _Kerberos tickets_ using the keytab.
    3. When the UGI contains both a delegation token and a Kerberos ticket, the 
delegation token is preferred.   After expiration, Flink does not fallback to 
using the ticket.
    
    *Solution*:
    1. Change Flink client to upload delegation tokens.  Addresses problem 1.
    2 Change Flink cluster to filter out the HDFS delegation token from the 
tokens loaded from storage when populating the UGI.  Addresses problem 3.
    3 Change JM to propagate its stored tokens to the TM, rather than the 
tokens from the UGI (which were filtered in (2).


> when deploy flink cluster on the yarn, it is lack of hdfs delegation token.
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-6376
>                 URL: https://issues.apache.org/jira/browse/FLINK-6376
>             Project: Flink
>          Issue Type: Bug
>          Components: Security, YARN
>            Reporter: zhangrucong1982
>            Assignee: zhangrucong1982
>
> 1、I use the flink of version 1.2.0. And  I deploy the flink cluster on the 
> yarn. The hadoop version is 2.7.2.
> 2、I use flink in security model with the keytab and principal. And the key 
> configuration is :security.kerberos.login.keytab: /home/ketab/test.keytab 
> 、security.kerberos.login.principal: test.
> 3、The yarn configuration is default and enable the yarn log aggregation 
> configuration" yarn.log-aggregation-enable : true";
> 4、 Deploying the flink cluster  on the yarn,  the yarn Node manager occur the 
> following failure when aggregation the log in HDFS. The basic reason is lack 
> of HDFS  delegation token. 
>  java.io.IOException: Failed on local exception: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]; Host Details : local host is: 
> "SZV1000258954/10.162.181.24"; destination host is: "SZV1000258954":25000;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:796)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1515)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1447)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:802)
>         at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:201)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
>         at com.sun.proxy.$Proxy27.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1919)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1500)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1496)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1496)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:271)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:68)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:299)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1769)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:284)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:390)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:342)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:470)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:194)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]
>         at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:722)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1769)
>         at 
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:685)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:772)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:394)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1564)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1486)
>         ... 29 more
> Caused by: org.apache.hadoop.security.AccessControlException: Client cannot 
> authenticate via:[TOKEN, KERBEROS]
>         at 
> org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:177)
>         at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:404)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:581)
>         at 
> org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:394)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:764)
>         at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:760)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1769)
>         at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:759)
>         ... 32 more
> 5、the hadoop fix the hadoop issue 
> 14116(https://issues.apache.org/jira/browse/HADOOP-14116), if there is no 
> HDFS  delegation token, it will try 20 times after sleeping 1 second. So it 
> will cause the flink cluster deploy on yarn is very slowly, it will spent 
> about 5 minutes to deploy the cluster with 2 taskmanagers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to