Wei-Chiu Chuang commented on HADOOP-15378:

Thank you, Steve. BTW your book is valuable in troubleshooting this issue.

No .... unfortunately current CDH5 doesn't have KDiag (I thought of backporting 
it but I forgot).
We did ask for JDK Kerberos debug and Hadoop debug log. But we corrected the 
invalid krb5.conf before the debug log was put into place.

> Hadoop client unable to relogin because a remote DataNode has an incorrect 
> krb5.conf
> ------------------------------------------------------------------------------------
>                 Key: HADOOP-15378
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15378
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: security
>    Affects Versions: 2.6.0
>         Environment: CDH5.8.3, Kerberized, Impala
>            Reporter: Wei-Chiu Chuang
>            Priority: Critical
> This is a very weird bug.
> We received a report where a Hadoop client (Impala Catalog server) failed to 
> relogin and crashed every several hours. Initial indication suggested the 
> symptom matched HADOOP-13433.
> But after we patched HADOOP-13433 (as well as HADOOP-15143), Impala Catalog 
> server still kept crashing.
> {noformat}
> W0114 05:49:24.676743 41444 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example....@example.com 
> (auth:KERBEROS) 
> cause:org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException):
>  Failure to initialize security context
> W0114 05:49:24.680363 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host2.example....@example.com), remove and destroy it.
> W0114 05:49:24.680501 41444 UserGroupInformation.java:1137] The first 
> kerberos ticket is not TGT(the server principal is 
> hdfs/host3.example....@example.com), remove and destroy it.
> W0114 05:49:24.680593 41444 UserGroupInformation.java:1153] Warning, no 
> kerberos ticket found while attempting to renew ticket{noformat}
> The error “Failure to initialize security context” is suspicious here. 
> Catalogd was unable to log in because of a Kerberos issue. The JDK expects 
> the first kerberos ticket of a principal to be a TGT, however it seems that 
> after this error, because it was unable to login successfully, the first 
> ticket was no longer a TGT. The patch HADOOP-13433 removed other tickets of 
> the principal, because it expects the TGT to be in the principal’s ticket, 
> which is untrue in this case. So finally, it removed all tickets.
> And then
> {noformat}
> W0114 05:49:24.681946 41443 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example....@example.com 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)]
> {noformat}
> The error “Failed to find any Kerberos tgt” is typically an indication that 
> the user’s Kerberos ticket has expired. However, that’s definitely not the 
> case here, since it was just a little over 8 hours.
> After we patched HADOOP-13433, the error handling code exhibited NPE, as 
> reported in HADOOP-15143.
> {code:java}
> I0114 05:50:26.758565 6384 RetryInvocationHandler.java:148] Exception while 
> invoking listCachePools of class ClientNamenodeProtocolTranslatorPB over 
> host4.example.com/ after 2 fail over attempts. Trying to fail 
> over immediately. Java exception follows: java.io.IOException: Failed on 
> local exception: java.io.IOException: Couldn't set up IO streams; Host 
> Details : local host is: "host1.example.com/"; destination host 
> is: "host4.example.com":8020; at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1506) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1439) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>  at com.sun.proxy.$Proxy9.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.listCachePools(ClientNamenodeProtocolTranslatorPB.java:1261)
>  at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>  at com.sun.proxy.$Proxy10.listCachePools(Unknown Source) at 
> org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:55)
>  at 
> org.apache.hadoop.hdfs.protocol.CachePoolIterator.makeRequest(CachePoolIterator.java:33)
>  at 
> org.apache.hadoop.fs.BatchedRemoteIterator.makeRequest(BatchedRemoteIterator.java:77)
>  at 
> org.apache.hadoop.fs.BatchedRemoteIterator.makeRequestIfNeeded(BatchedRemoteIterator.java:85)
>  at 
> org.apache.hadoop.fs.BatchedRemoteIterator.hasNext(BatchedRemoteIterator.java:99)
>  at 
> com.cloudera.impala.catalog.CatalogServiceCatalog$CachePoolReader.run(CatalogServiceCatalog.java:193)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: 
> Couldn't set up IO streams at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:826) at 
> org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:396) at 
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1555) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1478) ... 23 more Caused by: 
> java.lang.NullPointerException at 
> org.apache.hadoop.security.UserGroupInformation.fixKerberosTicketOrder(UserGroupInformation.java:1136)
>  at 
> org.apache.hadoop.security.UserGroupInformation.reloginFromTicketCache(UserGroupInformation.java:1272)
>  at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:697) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:422) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1835)
>  at 
> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:681)
>  at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:769) 
> ... 26 more
> {code}
> In any case, HADOOP-15143 does not fix the problem, since valid credentials 
> were already removed from the user.
> Also, even after HADOOP-13433 was applied, I still saw the following error:
> {noformat}
> W0113 14:33:44.254727 1255277 UserGroupInformation.java:1838] 
> PriviledgedActionException as:impala/host1.example....@example.com 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: The 
> ticket isn't for us (35) - BAD TGS SERVER NAME)]
> {noformat}
> We traced the SaslException back to a particular remote DataNode. Finally, it 
> turns out that the DataNode's host has an incorrect krb5.conf, which points 
> to a decommissioned KDC, inconsistent with the rest of the hosts in the 
> cluster. Upon authentication, it seems the error handling for the 
> SaslException has a race condition, and occasionally it removes valid 
> credential from the UGI. When we corrected the krb5.conf on that host, this 
> error disappeared.
> This is not easy to reproduce, but we are told this issue seems to occur 
> after Impala user performs a "refresh" command. Presumably this command 
> forces Impala catalog server to connect to the problematic DataNode and 
> triggered the buggy code.
> I don't have a patch to fix it but I want to raise this issue for discussion.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to