[ 
https://issues.apache.org/jira/browse/HDFS-11590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203855#comment-16203855
 ] 

Daryn Sharp commented on HDFS-11590:
------------------------------------

Skimmed the patch, I think it probably looks ok, but the test is only proving 
renewer attempted to close the files.  I'd like to see a test verify the client 
was unregistered from the renewer and doesn't call renew on it again – I 
haven't yet verified that happens.  Likewise that other clients are not removed 
and continue to be renewed.

I'd prefer the test be more precise by specifically triggering renewals and 
verifying the resulting behavior instead of waiting up to 5s.  Timeouts are 
always problematic on very slow build nodes.

> Nodemanagers have DDoS our namenode due to HDFS_DELEGATION_TOKEN expired or 
> not in the cache
> --------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11590
>                 URL: https://issues.apache.org/jira/browse/HDFS-11590
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.6.0
>         Environment: Releases:
> cloudera release cdh-5.5.0
> openjdk version "1.8.0_91"
> linux centos6 servers
> Cluster info:
> Namenode and resourcemanager in HA with kerberos authentication
> More than 1300 datanodes/nodemanagers
>            Reporter: Nicolas Fraison
>            Priority: Minor
>         Attachments: HDFS-11590.001.patch, HDFS-11590.002.patch, 
> HDFS-11590.patch
>
>
> We have faced some huge slowdowns on our namenode due to all our nodemanagers 
> continuing to retry to renew a lease and reconnecting to the namenode every 
> second during 1 hour due to some HDFS_DELEGATION_TOKEN being expired or not 
> in the cache.
> The number of time_wait connection on our namenode was stuck to the maximum 
> configured of 250k during this period due to the reconnections each time.
> {code}
> 2017-03-02 11:51:42,817 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for appattempt_1488396860014_156103_000001 
> (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
>   2017-03-02 11:51:43,414 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for appattempt_1488396860014_156120_000001 
> (auth:TOKEN) for protocol=interface 
> org.apache.hadoop.yarn.api.ContainerManagementProtocolPB
>   2017-03-02 11:51:51,994 WARN 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:prediction (auth:SIMPLE) 
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
>   2017-03-02 11:51:51,995 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
>   2017-03-02 11:51:51,995 WARN 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:prediction (auth:SIMPLE) 
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
>   2017-03-02 11:51:51,995 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_1560141256_4187204] for 30 seconds.  
> Will retry shortly ...
>   token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) is expired
>      at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>      at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>      at com.sun.proxy.$Proxy20.renewLease(Unknown Source)
>      at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
>      at sun.reflect.GeneratedMethodAccessor74.invoke(Unknown Source)
>      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:498)
>      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
>      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>      at com.sun.proxy.$Proxy21.renewLease(Unknown Source)
>      at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:921)
>      at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
>      at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
>      at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
>      at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
>      at java.lang.Thread.run(Thread.java:745)
>   2017-03-02 12:51:22,032 WARN 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:prediction (auth:SIMPLE) 
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found 
> in cache
>   2017-03-02 12:51:22,032 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found 
> in cache
>   2017-03-02 12:51:22,033 WARN 
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException 
> as:prediction (auth:SIMPLE) 
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found 
> in cache
>   2017-03-02 12:51:22,033 WARN org.apache.hadoop.hdfs.DFSClient: Failed to 
> renew lease for DFSClient_NONMAPREDUCE_1560141256_4187204 for 3600 seconds 
> (>= hard-limit =3600 seconds.) Closing all files being written ...
>   token (HDFS_DELEGATION_TOKEN token 111018676 for prediction) can't be found 
> in cache
>      at org.apache.hadoop.ipc.Client.call(Client.java:1472)
>      at org.apache.hadoop.ipc.Client.call(Client.java:1403)
>      at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>      at com.sun.proxy.$Proxy20.renewLease(Unknown Source)
>      at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:571)
>      at sun.reflect.GeneratedMethodAccessor74.invoke(Unknown Source)
>      at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:498)
>      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
>      at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>      at com.sun.proxy.$Proxy21.renewLease(Unknown Source)
>      at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:921)
>      at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:423)
>      at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:448)
>      at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
>      at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:304)
>      at java.lang.Thread.run(Thread.java:745)
>   2017-03-02 12:51:27,364 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
>  rollingMonitorInterval is set as -1. The log rolling mornitoring interval is 
> disabled. The logs will be aggregated after this application is finished.
> {code}
> The root cause is the yarn proxy configuration having been removed, which in 
> turn causes the resource manager to be unable to renew the 
> HDFS_DELEGATION_TOKEN.
> Even though the root cause has been identified, I don't think retrying to 
> renew a lease every second for an hour when there is an expiry/not found 
> token issue is normal because this is not an issue that can be recovered.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to