[ 
https://issues.apache.org/jira/browse/HDFS-13040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16335359#comment-16335359
 ] 

Istvan Fajth commented on HDFS-13040:
-------------------------------------

Hi [~jojochuang],

let me add as well a quick comment on the patch, it seems that you have added 
the proxyUser usage to the getCurrentEditLogTxid() method.
 In the DFSInotifyEventInputStream there are two RPC calls, one is to the 
getCurrentEditLogTxid(), and an other to getEditsFromTxid(), I believe the 
problem can arise with any one of those, however the current call sequence is 
implying that if the getCurrentEditLogTxid succeeds, then getEditsFromTxid can 
succeed unless the ticket experies exactly between the two calls.

[~daryn], sorry for the long comment, but let me add what I know, and think 
important at this point, I am happy to share more if there are any questions.

I have tried to set up a test as well with MiniKDC, MiniJournalCluster, and 
MiniDFSCluster, I had struggles when I wanted to login with a new keytab, and 
call doAs with the new UGI, as the UGI contained as well traces of the 
principal that was used by the cluster.
 What I see:
 When I do a UserGroupInformation.reset() after cluster startup, this code in 
the JUnit test:
{code:java}
      UserGroupInformation ugi = 
UserGroupInformation.loginUserFromKeytabAndReturnUGI("hdfs", 
generalHDFSKeytabFile.getAbsolutePath());

      LOG.info("global: "+UserGroupInformation.getCurrentUser() + ", login as:" 
+ UserGroupInformation.getLoginUser());
      LOG.info("ugi: "+ugi.getUserName() + " authenticated via " + 
ugi.getAuthenticationMethod());
      ugi.doAs(new PrivilegedExceptionAction<Void>() {
        @Override
        public Void run() throws Exception {
          LOG.info("Current user in doAs: 
"+UserGroupInformation.getCurrentUser()+" loginuser 
is:"+UserGroupInformation.getLoginUser());
          fs.mkdirs(new Path("/user/test"));
          fs.listFiles(new Path("/"), true);
          LOG.info("ListFiles success");
          return null;
        }
      });
{code}
produces the following log:
{code:java}
018-01-23 04:08:45,689 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitActive(2346)) - Cluster is active
2018-01-23 04:08:45,715 INFO  hdfs.TestDFSInotifyEventInputStream 
(TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(534)) - global: 
pifta (auth:KERBEROS), login as:pifta (auth:KERBEROS)
2018-01-23 04:08:45,715 INFO  hdfs.TestDFSInotifyEventInputStream 
(TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(535)) - ugi: 
h...@example.com authenticated via KERBEROS
2018-01-23 04:08:45,715 INFO hdfs.TestDFSInotifyEventInputStream 
(TestDFSInotifyEventInputStream.java:run(539)) - Current user in doAs: 
h...@example.com (auth:KERBEROS) loginuser is:pifta (auth:KERBEROS)
2018-01-23 04:08:45,722 WARN  security.UserGroupInformation 
(UserGroupInformation.java:doAs(1923)) - PriviledgedActionException as:pifta 
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
2018-01-23 04:08:45,723 WARN  ipc.Client (Client.java:run(713)) - Exception 
encountered while connecting to the server : javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
2018-01-23 04:08:45,723 WARN  security.UserGroupInformation 
(UserGroupInformation.java:doAs(1923)) - PriviledgedActionException as:pifta 
(auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: 
GSS initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
2018-01-23 04:08:45,728 WARN  security.UserGroupInformation 
(UserGroupInformation.java:doAs(1923)) - PriviledgedActionException 
as:h...@example.com (auth:KERBEROS) cause:java.io.IOException: Failed on local 
exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate 
failed [Caused by GSSException: No valid credentials provided (Mechanism level: 
Failed to find any Kerberos tgt)]; Host Details : local host is: 
"host.local/192.168.0.122"; destination host is: "localhost":58506; 
2018-01-23 04:08:45,728 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:shutdown(1744)) - Shutting down the Mini HDFS Cluster
{code}
So this time I do have a loginUser called pifta that does not have any 
authentication or tgt it is the system user I use.

When I don't reset the ugi, the same code provides the following log:
{code:java}
2018-01-23 04:17:49,936 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:waitActive(2346)) - Cluster is active
2018-01-23 04:17:49,953 INFO  hdfs.TestDFSInotifyEventInputStream 
(TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(534)) - global: 
hdfs/localh...@example.com (auth:KERBEROS), login as:hdfs/localh...@example.com 
(a
uth:KERBEROS)
2018-01-23 04:17:49,953 INFO  hdfs.TestDFSInotifyEventInputStream 
(TestDFSInotifyEventInputStream.java:testWithKerberizedCluster(535)) - ugi: 
h...@example.com authenticated via KERBEROS
2018-01-23 04:17:49,954 ERROR hdfs.TestDFSInotifyEventInputStream 
(TestDFSInotifyEventInputStream.java:run(539)) - Current user in doAs: 
h...@example.com (auth:KERBEROS) loginuser is:hdfs/localh...@example.com 
(auth:KERBEROS)
2018-01-23 04:17:49,983 INFO  FSNamesystem.audit 
(FSNamesystem.java:logAuditMessage(9601)) - allowed=true       
ugi=hdfs/localh...@example.com (auth:KERBEROS)  ip=/127.0.0.1   cmd=mkdirs      
src=/user/test  dst=null        per
m=hdfs:supergroup:rwxr-xr-x  proto=rpc
2018-01-23 04:17:49,992 INFO  FSNamesystem.audit 
(FSNamesystem.java:logAuditMessage(9601)) - allowed=true       
ugi=hdfs/localh...@example.com (auth:KERBEROS)  ip=/127.0.0.1   cmd=listStatus  
src=/   dst=null        perm=null
       proto=rpc
2018-01-23 04:17:49,997 INFO  hdfs.TestDFSInotifyEventInputStream 
(TestDFSInotifyEventInputStream.java:run(542)) - ListFiles success
2018-01-23 04:17:49,997 INFO  hdfs.MiniDFSCluster 
(MiniDFSCluster.java:shutdown(1744)) - Shutting down the Mini HDFS Cluster
{code}
So if I reset, my loginUser is the system user, when I do not reset my 
loginuser is the hdfs user used as the NN principal, can you point out what I 
am doing wrong? I am attaching the full test code as well.

 

For the actual issue let me try to restate the issue. The symptom exactly is:
 - At NameNode startup, the NameNode acquires a tgt, and it is saved in its 
ticket cache in the file defined in $KRB5CCNAME environment variable, or in 
/tmp/krb5cc_uid which is the default if $KRB5CCNAME is not defined.
 - When we have an Inotify client connecting to the NameNode after the ticket 
in the ticket cache has expired, and the Inotify API was not called on this 
NameNode prior that (due to the first client call started after the expiration, 
or due to a failover after expiration) then the call fails with the attached 
exception on client side, and a GSSException on the NN side (see later).
 - When there is a call to a given NameNode's Inotify API before the ticket is 
expired, then the problem does not occur even after the ticket in the ticket 
cache expires.

 

On a successful call the KerberosAuthenticator class when sets up the 
connection to the JournalNodes to gather the edits and provide them to the 
client, does the authentication with the help of the internal java facilities, 
based on the loginUser.
 This is supported by the log message in the NN log:
{code:java}
DEBUG org.apache.hadoop.security.authentication.client.KerberosAuthenticator: 
JDK performed authentication on our behalf.
{code}
On a failing call the KerberosAuthentication class logs that the JDK 
authentication does not succeed, and it tries an SPNEGO sequence:
{code:java}
DEBUG org.apache.hadoop.security.authentication.client.KerberosAuthenticator: 
Performing our own SPNEGO sequence.
{code}
In this case just as with the successful ones, the current user is the client's 
principal, and the loginUser is the NN principal, and the exception in the 
NameNode log on the failure is:
{code:java}
ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught 
exception initializing 
http://non.local.JN.host:8480/getJournal?jid=nameservice1&segmentTxId=209049&storageInfo=-60%3A1546104427%3A0%3Acluster2
java.io.IOException: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:473)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:465)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:477)
    at 
org.apache.hadoop.security.SecurityUtil.doAsCurrentUser(SecurityUtil.java:471)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog.getInputStream(EditLogFileInputStream.java:464)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.init(EditLogFileInputStream.java:141)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOpImpl(EditLogFileInputStream.java:192)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:250)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
    at 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
    at 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:178)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.readOp(NameNodeRpcServer.java:1701)
    at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getEditsFromTxid(NameNodeRpcServer.java:1763)
    at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getEditsFromTxid(AuthorizationProviderProxyClientProtocol.java:1011)
    at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getEditsFromTxid(ClientNamenodeProtocolServerSideTranslatorPB.java:1490)
    at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
Caused by: 
org.apache.hadoop.security.authentication.client.AuthenticationException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)
    at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:328)
    at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:203)
    at 
org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:215)
    at 
org.apache.hadoop.hdfs.web.URLConnectionFactory.openConnection(URLConnectionFactory.java:161)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:470)
    ... 30 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)
    at 
sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
    at 
sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
    at 
sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
    at 
sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
    at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator$1.run(KerberosAuthenticator.java:307)
    at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator$1.run(KerberosAuthenticator.java:283)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at 
org.apache.hadoop.security.authentication.client.KerberosAuthenticator.doSpnegoSequence(KerberosAuthenticator.java:283)
    ... 34 more
{code}
The JournalNode only sees authentication with the NN principal, and it never 
gets the clients principal in any subject as we checked.

> Kerberized inotify client fails despite kinit properly
> ------------------------------------------------------
>
>                 Key: HDFS-13040
>                 URL: https://issues.apache.org/jira/browse/HDFS-13040
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>         Environment: Kerberized, HA cluster, iNotify client, CDH5.10.2
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Major
>         Attachments: HDFS-13040.001.patch, TransactionReader.java
>
>
> This issue is similar to HDFS-10799.
> HDFS-10799 turned out to be a client side issue where client is responsible 
> for renewing kerberos ticket actively.
> However we found in a slightly setup even if client has valid Kerberos 
> credentials, inotify still fails.
> Suppose client uses principal h...@example.com, 
>  namenode 1 uses server principal hdfs/nn1.example....@example.com
>  namenode 2 uses server principal hdfs/nn2.example....@example.com
> *After Namenodes starts for longer than kerberos ticket lifetime*, the client 
> fails with the following error:
> {noformat}
> 18/01/19 11:23:02 WARN security.UserGroupInformation: 
> PriviledgedActionException as:h...@gce.cloudera.com (auth:KERBEROS) 
> cause:org.apache.hadoop.ipc.RemoteException(java.io.IOException): We 
> encountered an error reading 
> https://nn2.example.com:8481/getJournal?jid=ns1&segmentTxId=8662&storageInfo=-60%3A353531113%3A0%3Acluster3,
>  
> https://nn1.example.com:8481/getJournal?jid=ns1&segmentTxId=8662&storageInfo=-60%3A353531113%3A0%3Acluster3.
>   During automatic edit log failover, we noticed that all of the remaining 
> edit log streams are shorter than the current one!  The best remaining edit 
> log ends at transaction 8683, but we thought we could read up to transaction 
> 8684.  If you continue, metadata will be lost forever!
>         at 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:213)
>         at 
> org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.readOp(NameNodeRpcServer.java:1701)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getEditsFromTxid(NameNodeRpcServer.java:1763)
>         at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getEditsFromTxid(AuthorizationProviderProxyClientProtocol.java:1011)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getEditsFromTxid(ClientNamenodeProtocolServerSideTranslatorPB.java:1490)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
> {noformat}
> Typically if NameNode has an expired Kerberos ticket, the error handling for 
> the typical edit log tailing would let NameNode to relogin with its own 
> Kerberos principal. However, when inotify uses the same code path to retrieve 
> edits, since the current user is the inotify client's principal, unless 
> client uses the same principal as the NameNode, NameNode can't do it on 
> behalf of the client.
> Therefore, a more appropriate approach is to use proxy user so that NameNode 
> can retrieving edits on behalf of the client.
> I will attach a patch to fix it. This patch has been verified to work for a 
> CDH5.10.2 cluster, however it seems impossible to craft a unit test for this 
> fix because the way Hadoop UGI handles Kerberos credentials (I can't have a 
> single process that logins as two Kerberos principals simultaneously and let 
> them establish connection)
> A possible workaround is for the inotify client to use the active NameNode's 
> server principal. However, that's not going to work when there's a namenode 
> failover, because then the client's principal will not be consistent with the 
> active NN's one, and then fails to authenticate.
> Credit: this bug was confirmed and reproduced by [~pifta] and [~r1pp3rj4ck]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to