[ 
https://issues.apache.org/jira/browse/HDFS-5322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13791674#comment-13791674
 ] 

Jing Zhao commented on HDFS-5322:
---------------------------------

bq. Again, the basic question driving this change is why 
FSNamesystem#checkOperation(OperationCategory.WRITE) is not throwing during a 
transition to active?

During the transition (Standby -> Active), the current code first sets the 
state of the NN to Active, then starts the active service, during which the NN 
still needs to tail the remaining editlog. If a delegation token is contained 
in that last part of editlog, 1) 
FSNamesystem#checkOperation(OperationCategory.WRITE) will not throw anything 
since the NN's state has already been changed to Active, 2) the new ANN cannot 
find the token in its cache since it has not finished applying the editlog. We 
should allow clients to retry since after NN finishes reading the editlog the 
delegation token can be recognized.

In the meanwhile, if we let the NN first start active service, then change its 
state to standby, your original hack in HADOOP-9880 can work, since a 
standbyexception will be thrown. But this change will 1) extend the failover 
time, and 2) trigger unnecessary client failover. And I'm not sure if this will 
break other code.

> HDFS delegation token not found in cache errors seen on secure HA clusters
> --------------------------------------------------------------------------
>
>                 Key: HDFS-5322
>                 URL: https://issues.apache.org/jira/browse/HDFS-5322
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.1.1-beta
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>         Attachments: HDFS-5322.000.patch, HDFS-5322.000.patch, 
> HDFS-5322.001.patch, HDFS-5322.002.patch, HDFS-5322.003.patch, 
> HDFS-5322.004.patch
>
>
> While running HA tests we have seen issues were we see HDFS delegation token 
> not found in cache errors causing jobs running to fail.
> {code}
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
> |2013-10-06 20:14:51,193 INFO  [main] mapreduce.Job: Task Id : 
> attempt_1381090351344_0001_m_000007_0, Status : FAILED
> Error: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 11 for hrt_qa) can't be found in cache
> at org.apache.hadoop.ipc.Client.call(Client.java:1347)
> at org.apache.hadoop.ipc.Client.call(Client.java:1300)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to