[jira] [Created] (AMBARI-10518) Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode check due to not kinit'd hdfs krb cache properly

Hari Sekhon (JIRA) Thu, 16 Apr 2015 03:50:32 -0700

Hari Sekhon created AMBARI-10518:
------------------------------------

             Summary: Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on 
safe mode check due to not kinit'd hdfs krb cache properly
                 Key: AMBARI-10518
                 URL: https://issues.apache.org/jira/browse/AMBARI-10518
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server, stacks
    Affects Versions: 2.0.0
         Environment: HDP 2.2.0.0 => 2.2.4.0
            Reporter: Hari Sekhon



After deploying the new HDP 2.2.4.0 stack to all nodes successfully in Ambari 
2.0, the "perform upgrade" procedure fails on the first step:
{code}Fail: 2015-04-16 11:36:32,623 - Performing a(n) upgrade of HDFS
2015-04-16 11:36:32,624 - u"Execute['/usr/bin/kinit -kt 
/etc/security/keytabs/hdfs.headless.keytab hdfs']" {}
2015-04-16 11:36:32,811 - Prepare to transition into safemode state OFF
2015-04-16 11:36:32,812 - call['su - hdfs -c 'hdfs dfsadmin -safemode get''] {}
2015-04-16 11:36:36,481 - Command: su - hdfs -c 'hdfs dfsadmin -safemode get'
Code: 255.
2015-04-16 11:36:36,481 - Error while executing command 
'prepare_rolling_upgrade':
Traceback (most recent call last):
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 214, in execute
    method(env)
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
 line 67, in prepare_rolling_upgrade
    namenode_upgrade.prepare_rolling_upgrade()
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py",
 line 100, in prepare_rolling_upgrade
    raise Fail("Could not transition to safemode state %s. Please check logs to 
make sure namenode is up." % str(SafeMode.OFF))
Fail: Could not transition to safemode state OFF. Please check logs to make 
sure namenode is up.
2015-04-16 11:36:36,481 - Error while executing command 
'prepare_rolling_upgrade':
Traceback (most recent call last):
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 214, in execute
    method(env)
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
 line 67, in prepare_rolling_upgrade
    namenode_upgrade.prepare_rolling_upgrade()
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py",
 line 100, in prepare_rolling_upgrade
    raise Fail("Could not transition to safemode state %s. Please check logs to 
make sure namenode is up." % str(SafeMode.OFF))
Fail: Could not transition to safemode state OFF. Please check logs to make 
sure namenode is up.{code}
It looks like this is because the Kerberos cache was not properly initialized, 
as I can see an old expired cache:
{code}
# su - hdfs -c 'hdfs dfsadmin -safemode get'
15/04/16 11:42:23 WARN ipc.Client: Exception encountered while connecting to 
the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)]
safemode: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]; Host Details : local host is: "<host>/<ip>"; destination host is: 
"<host>":8020;
# echo $?
255
# su - hdfs
[hdfs@<host> ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1008
Default principal: hdfs@LOCALDOMAIN

Valid starting     Expires            Service principal
04/13/15 16:10:59  04/14/15 16:10:59  krbtgt/LOCALDOMAIN@LOCALDOMAIN
        renew until 04/20/15 16:10:59
[hdfs@<host> ~]$ /usr/bin/kinit -kt /etc/security/keytabs/hdfs.headless.keytab 
hdfs
[hdfs@<host> ~]$ logout
# su - hdfs -c 'hdfs dfsadmin -safemode get'
Safe mode is OFF in <nn1>/<ip1>:8020
Safe mode is OFF in <nn2>/<ip2>:8020
{code}
It looks like the kerberos cached was initialized for root instead of the hdfs 
user since the kinit command didn't have a su - hdfs with it.

I had retried once with the same result to get the error again for this jira, 
but after I logged in as hdfs and manually kinit'd the hdfs user's krb cache 
and retried again in Ambari it succeeded, so that is the workaround for now.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (AMBARI-10518) Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on safe mode check due to not kinit'd hdfs krb cache properly

Reply via email to