[ https://issues.apache.org/jira/browse/HDFS-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335359#comment-16335359 ]
Istvan Fajth commented on HDFS-13040: ------------------------------------- Hi [~jojochuang], let me add as well a quick comment on the patch, it seems that you have added the proxyUser usage to the getCurrentEditLogTxid() method. In the DFSInotifyEventInputStream there are two RPC calls, one is to the getCurrentEditLogTxid(), and an other to getEditsFromTxid(), I believe the problem can arise with any one of those, however the current call sequence is implying that if the getCurrentEditLogTxid succeeds, then getEditsFromTxid can succeed unless the ticket experies exactly between the two calls. [~daryn], sorry for the long comment, but let me add what I know, and think important at this point, I am happy to share more if there are any questions. I have tried to set up a test as well with MiniKDC, MiniJournalCluster, and MiniDFSCluster, I had struggles when I wanted to login with a new keytab, and call doAs with the new UGI, as the UGI contained as well traces of the principal that was used by the cluster. What I see: When I do a UserGroupInformation.reset() after cluster startup, this code in the JUnit test: {code:java} UserGroupInformation ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI("hdfs", generalHDFSKeytabFile.getAbsolutePath()); LOG.info("global: "+UserGroupInformation.getCurrentUser() + ", login as:" + UserGroupInformation.getLoginUser()); LOG.info("ugi: "+ugi.getUserName() + " authenticated via " + ugi.getAuthenticationMethod()); ugi.doAs(new PrivilegedExceptionAction<Void>() { @Override public Void run() throws Exception { LOG.info("Current user in doAs: "+UserGroupInformation.getCurrentUser()+" loginuser is:"+UserGroupInformation.getLoginUser()); fs.mkdirs(new Path("/user/test")); fs.listFiles(new Path("/"), true); LOG.info("ListFiles success"); return null; } }); {code} produces the following log: {code:java} 018-01-23 04:08:45,689 INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:waitActive(2346)) - Cluster is active 2018-01-23 04:08:45,715 INFO hdfs.TestDFSInotifyEventInputStream (TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(534)) - global: pifta (auth:KERBEROS), login as:pifta (auth:KERBEROS) 2018-01-23 04:08:45,715 INFO hdfs.TestDFSInotifyEventInputStream (TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(535)) - ugi: h...@example.com authenticated via KERBEROS 2018-01-23 04:08:45,715 INFO hdfs.TestDFSInotifyEventInputStream (TestDFSInotifyEventInputStream.java:run(539)) - Current user in doAs: h...@example.com (auth:KERBEROS) loginuser is:pifta (auth:KERBEROS) 2018-01-23 04:08:45,722 WARN security.UserGroupInformation (UserGroupInformation.java:doAs(1923)) - PriviledgedActionException as:pifta (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2018-01-23 04:08:45,723 WARN ipc.Client (Client.java:run(713)) - Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2018-01-23 04:08:45,723 WARN security.UserGroupInformation (UserGroupInformation.java:doAs(1923)) - PriviledgedActionException as:pifta (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2018-01-23 04:08:45,728 WARN security.UserGroupInformation (UserGroupInformation.java:doAs(1923)) - PriviledgedActionException as:h...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "host.local/192.168.0.122"; destination host is: "localhost":58506; 2018-01-23 04:08:45,728 INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1744)) - Shutting down the Mini HDFS Cluster {code} So this time I do have a loginUser called pifta that does not have any authentication or tgt it is the system user I use. When I don't reset the ugi, the same code provides the following log: {code:java} 2018-01-23 04:17:49,936 INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:waitActive(2346)) - Cluster is active 2018-01-23 04:17:49,953 INFO hdfs.TestDFSInotifyEventInputStream (TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(534)) - global: hdfs/localh...@example.com (auth:KERBEROS), login as:hdfs/localh...@example.com (a uth:KERBEROS) 2018-01-23 04:17:49,953 INFO hdfs.TestDFSInotifyEventInputStream (TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(535)) - ugi: h...@example.com authenticated via KERBEROS 2018-01-23 04:17:49,954 ERROR hdfs.TestDFSInotifyEventInputStream (TestDFSInotifyEventInputStream.java:run(539)) - Current user in doAs: h...@example.com (auth:KERBEROS) loginuser is:hdfs/localh...@example.com (auth:KERBEROS) 2018-01-23 04:17:49,983 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(9601)) - allowed=true ugi=hdfs/localh...@example.com (auth:KERBEROS) ip=/127.0.0.1 cmd=mkdirs src=/user/test dst=null per m=hdfs:supergroup:rwxr-xr-x proto=rpc 2018-01-23 04:17:49,992 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(9601)) - allowed=true ugi=hdfs/localh...@example.com (auth:KERBEROS) ip=/127.0.0.1 cmd=listStatus src=/ dst=null perm=null proto=rpc 2018-01-23 04:17:49,997 INFO hdfs.TestDFSInotifyEventInputStream (TestDFSInotifyEventInputStream.java:run(542)) - ListFiles success 2018-01-23 04:17:49,997 INFO hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1744)) - Shutting down the Mini HDFS Cluster {code} So if I reset, my loginUser is the system user, when I do not reset my loginuser is the hdfs user used as the NN principal, can you point out what I am doing wrong? I am attaching the full test code as well. For the actual issue let me try to restate the issue. The symptom exactly is: - At NameNode startup, the NameNode acquires a tgt, and it is saved in its ticket cache in the file defined in $KRB5CCNAME environment variable, or in /tmp/krb5cc_uid which is the default if $KRB5CCNAME is not defined. - When we have an Inotify client connecting to the NameNode after the ticket in the ticket cache has expired, and the Inotify API was not called on this NameNode prior that (due to the first client call started after the expiration, or due to a failover after expiration) then the call fails with the attached exception on client side, and a GSSException on the NN side (see later). - When there is a call to a given NameNode's Inotify API before the ticket is expired, then the problem does not occur even after the ticket in the ticket cache expires. On a successful call the KerberosAuthenticator class when sets up the connection to the JournalNodes to gather the edits and provide them to the client, does the authentication with the help of the internal java facilities, based on the loginUser. This is supported by the log message in the NN log: {code:java} DEBUG org.apache.hadoop.security.authentication.client.KerberosAuthenticator: JDK performed authentication on our behalf. {code} On a failing call the KerberosAuthentication class logs that the JDK authentication does not succeed, and it tries an SPNEGO sequence: {code:java} DEBUG org.apache.hadoop.security.authentication.client.KerberosAuthenticator: Performing our own SPNEGO sequence. {code} In this case just as with the successful ones, the current user is the client's principal, and the loginUser is the NN principal, and the exception in the NameNode log on the failure is: {code:java} ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception initializing http://non.local.JN.host:8480/getJournal?jid=nameservice1&segmentTxId=209049&storageInfo=-60%3A1546104427%3A0%3Acluster2 java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:473) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:465) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926) at org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:477) at org.apache.hadoop.security.SecurityUtil.doAsCurrentUser(SecurityUtil.java:471) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog.getInputStream(EditLogFileInputStream.java:464) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.init(EditLogFileInputStream.java:141) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:192) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:250) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151) at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151) at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178) at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.readOp(NameNodeRpcServer.java:1701) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getEditsFromTxid(NameNodeRpcServer.java:1763) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getEditsFromTxid(AuthorizationProviderProxyClientProtocol.java:1011) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getEditsFromTxid(ClientNamenodeProtocolServerSideTranslatorPB.java:1490) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214) Caused by: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:328) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:203) at org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:215) at org.apache.hadoop.hdfs.web.URLConnectionFactory.openConnection(URLConnectionFactory.java:161) at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:470) ... 30 more Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121) at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator$1.run(KerberosAuthenticator.java:307) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator$1.run(KerberosAuthenticator.java:283) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:283) ... 34 more {code} The JournalNode only sees authentication with the NN principal, and it never gets the clients principal in any subject as we checked. > Kerberized inotify client fails despite kinit properly > ------------------------------------------------------ > > Key: HDFS-13040 > URL: https://issues.apache.org/jira/browse/HDFS-13040 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Environment: Kerberized, HA cluster, iNotify client, CDH5.10.2 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > Priority: Major > Attachments: HDFS-13040.001.patch, TransactionReader.java > > > This issue is similar to HDFS-10799. > HDFS-10799 turned out to be a client side issue where client is responsible > for renewing kerberos ticket actively. > However we found in a slightly setup even if client has valid Kerberos > credentials, inotify still fails. > Suppose client uses principal h...@example.com, > namenode 1 uses server principal hdfs/nn1.example....@example.com > namenode 2 uses server principal hdfs/nn2.example....@example.com > *After Namenodes starts for longer than kerberos ticket lifetime*, the client > fails with the following error: > {noformat} > 18/01/19 11:23:02 WARN security.UserGroupInformation: > PriviledgedActionException as:h...@gce.cloudera.com (auth:KERBEROS) > cause:org.apache.hadoop.ipc.RemoteException(java.io.IOException): We > encountered an error reading > https://nn2.example.com:8481/getJournal?jid=ns1&segmentTxId=8662&storageInfo=-60%3A353531113%3A0%3Acluster3, > > https://nn1.example.com:8481/getJournal?jid=ns1&segmentTxId=8662&storageInfo=-60%3A353531113%3A0%3Acluster3. > During automatic edit log failover, we noticed that all of the remaining > edit log streams are shorter than the current one! The best remaining edit > log ends at transaction 8683, but we thought we could read up to transaction > 8684. If you continue, metadata will be lost forever! > at > org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:213) > at > org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.readOp(NameNodeRpcServer.java:1701) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getEditsFromTxid(NameNodeRpcServer.java:1763) > at > org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getEditsFromTxid(AuthorizationProviderProxyClientProtocol.java:1011) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getEditsFromTxid(ClientNamenodeProtocolServerSideTranslatorPB.java:1490) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210) > {noformat} > Typically if NameNode has an expired Kerberos ticket, the error handling for > the typical edit log tailing would let NameNode to relogin with its own > Kerberos principal. However, when inotify uses the same code path to retrieve > edits, since the current user is the inotify client's principal, unless > client uses the same principal as the NameNode, NameNode can't do it on > behalf of the client. > Therefore, a more appropriate approach is to use proxy user so that NameNode > can retrieving edits on behalf of the client. > I will attach a patch to fix it. This patch has been verified to work for a > CDH5.10.2 cluster, however it seems impossible to craft a unit test for this > fix because the way Hadoop UGI handles Kerberos credentials (I can't have a > single process that logins as two Kerberos principals simultaneously and let > them establish connection) > A possible workaround is for the inotify client to use the active NameNode's > server principal. However, that's not going to work when there's a namenode > failover, because then the client's principal will not be consistent with the > active NN's one, and then fails to authenticate. > Credit: this bug was confirmed and reproduced by [~pifta] and [~r1pp3rj4ck] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org