[
https://issues.apache.org/jira/browse/HADOOP-13433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397198#comment-15397198
]
Duo Zhang commented on HADOOP-13433:
------------------------------------
In general, a better way is to not reuse the Subject object. The re-login
stages will be
1. Create a new subject and login.
2. Switch subject.
3. logout the old subject some while later since it may still be used by
someone.
I used to write code like this, it works. But I do not know if it works in
hadoop. If we add lots of other stuffs to the subject, then the algorithm maybe
broken because we do not know when to move these stuffs from the old subject to
new subject.
Thanks.
> Race in UGI.reloginFromKeytab
> -----------------------------
>
> Key: HADOOP-13433
> URL: https://issues.apache.org/jira/browse/HADOOP-13433
> Project: Hadoop Common
> Issue Type: Bug
> Components: security
> Reporter: Duo Zhang
>
> This is a problem that has troubled us for several years. For our HBase
> cluster, sometimes the RS will be stuck due to
> {noformat}
> 2016-06-20,03:44:12,936 INFO org.apache.hadoop.ipc.SecureClient: Exception
> encountered while connecting to the server :
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: The ticket
> isn't for us (35) - BAD TGS SERVER NAME)]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:194)
> at
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:140)
> at
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupSaslConnection(SecureClient.java:187)
> at
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.access$700(SecureClient.java:95)
> at
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:325)
> at
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection$2.run(SecureClient.java:322)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1781)
> at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.hbase.util.Methods.call(Methods.java:37)
> at org.apache.hadoop.hbase.security.User.call(User.java:607)
> at org.apache.hadoop.hbase.security.User.access$700(User.java:51)
> at
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:461)
> at
> org.apache.hadoop.hbase.ipc.SecureClient$SecureConnection.setupIOstreams(SecureClient.java:321)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1164)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1004)
> at
> org.apache.hadoop.hbase.ipc.SecureRpcEngine$Invoker.invoke(SecureRpcEngine.java:107)
> at $Proxy24.replicateLogEntries(Unknown Source)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:962)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.runLoop(ReplicationSource.java:466)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:515)
> Caused by: GSSException: No valid credentials provided (Mechanism level: The
> ticket isn't for us (35) - BAD TGS SERVER NAME)
> at
> sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:663)
> at
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
> at
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:180)
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:175)
> ... 23 more
> Caused by: KrbException: The ticket isn't for us (35) - BAD TGS SERVER NAME
> at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:64)
> at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:185)
> at
> sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:294)
> at
> sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:106)
> at
> sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:557)
> at
> sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:594)
> ... 26 more
> Caused by: KrbException: Identifier doesn't match expected value (906)
> at sun.security.krb5.internal.KDCRep.init(KDCRep.java:133)
> at sun.security.krb5.internal.TGSRep.init(TGSRep.java:58)
> at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:53)
> at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:46)
> ... 31 more
> {noformat}
> It rarely happens, but if it happens, the regionserver will be stuck and can
> never recover.
> Recently we added a log after a successful re-login which prints the private
> credentials, and finally catched the direct reason. After a successful
> re-login, we have two kerberos tickets in the credentials, one is the TGT,
> and the other is a service ticket. The strange thing is that, the service
> ticket is placed before TGT. This breaks the assumption of jdk's kerberos
> library. See
> http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5InitCredential.java,
> the {{getTgt}} Method
> {code:title=Krb5InitCredential}
> return AccessController.doPrivileged(
> new PrivilegedExceptionAction<KerberosTicket>() {
> public KerberosTicket run() throws Exception {
> // It's OK to use null as serverPrincipal. TGT is almost
> // the first ticket for a principal and we use list.
> return Krb5Util.getTicket(
> realCaller,
> clientPrincipal, null, acc);
> }});
> {code}
> So here, the library will use the service ticket as TGT to acquire a service
> ticket, and KDC will reject the request since the 'TGT' does not start with
> 'krbtgt'. And it can never recover because in UGI, the re-login will check if
> there is a valid TGT first and no doubt, we have one...
> This usually happens when a secure connection initialization comes along with
> the re-login, and the end time indicates that the service ticket is acquired
> by the previous TGT. Since UGI does not prevent doAs and re-login happen at
> the same time, we believe that there is a race condition.
> After reading the code, we found a possible race condition.
> See
> http://hg.openjdk.java.net/jdk8u/jdk8u60/jdk/file/935758609767/src/share/classes/sun/security/jgss/krb5/Krb5Context.java,
> the {{initSecContext}} method, we will get TGT first, then check if there is
> already a service ticket, if not, acquire a service ticket using the TGT, and
> put it into the credentials.
> And in Krb5LoginModule.logout(the sun version), we will remove the kerberos
> tickets from the credentials first, and then destroy them.
> Here comes the race condition. Let T1 be the secure connection set up thread,
> T2 be the re-login thread.
> T1: get TGT
> T2: remove all tickets from credentials
> T1: check service ticket, none(since all tickets have been removed)
> T1: acquire a new service ticket using TGT and put it into the credentials
> T2: destroy all tickets
> T2: login, i.e., put a new TGT into the credentials.
> It is hard to write a UT to produce the problem because the racing code is in
> jdk, which is not written by us...
> Suggestions are welcomed. Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]