Hari Sekhon created AMBARI-10518:
------------------------------------
Summary: Ambari 2.0 stack upgrade HDP 2.2.0.0 => 2.2.4.0 breaks on
safe mode check due to not kinit'd hdfs krb cache properly
Key: AMBARI-10518
URL: https://issues.apache.org/jira/browse/AMBARI-10518
Project: Ambari
Issue Type: Bug
Components: ambari-server, stacks
Affects Versions: 2.0.0
Environment: HDP 2.2.0.0 => 2.2.4.0
Reporter: Hari Sekhon
After deploying the new HDP 2.2.4.0 stack to all nodes successfully in Ambari
2.0, the "perform upgrade" procedure fails on the first step:
{code}Fail: 2015-04-16 11:36:32,623 - Performing a(n) upgrade of HDFS
2015-04-16 11:36:32,624 - u"Execute['/usr/bin/kinit -kt
/etc/security/keytabs/hdfs.headless.keytab hdfs']" {}
2015-04-16 11:36:32,811 - Prepare to transition into safemode state OFF
2015-04-16 11:36:32,812 - call['su - hdfs -c 'hdfs dfsadmin -safemode get''] {}
2015-04-16 11:36:36,481 - Command: su - hdfs -c 'hdfs dfsadmin -safemode get'
Code: 255.
2015-04-16 11:36:36,481 - Error while executing command
'prepare_rolling_upgrade':
Traceback (most recent call last):
File
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
line 214, in execute
method(env)
File
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
line 67, in prepare_rolling_upgrade
namenode_upgrade.prepare_rolling_upgrade()
File
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py",
line 100, in prepare_rolling_upgrade
raise Fail("Could not transition to safemode state %s. Please check logs to
make sure namenode is up." % str(SafeMode.OFF))
Fail: Could not transition to safemode state OFF. Please check logs to make
sure namenode is up.
2015-04-16 11:36:36,481 - Error while executing command
'prepare_rolling_upgrade':
Traceback (most recent call last):
File
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
line 214, in execute
method(env)
File
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
line 67, in prepare_rolling_upgrade
namenode_upgrade.prepare_rolling_upgrade()
File
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode_upgrade.py",
line 100, in prepare_rolling_upgrade
raise Fail("Could not transition to safemode state %s. Please check logs to
make sure namenode is up." % str(SafeMode.OFF))
Fail: Could not transition to safemode state OFF. Please check logs to make
sure namenode is up.{code}
It looks like this is because the Kerberos cache was not properly initialized,
as I can see an old expired cache:
{code}
# su - hdfs -c 'hdfs dfsadmin -safemode get'
15/04/16 11:42:23 WARN ipc.Client: Exception encountered while connecting to
the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed to find
any Kerberos tgt)]
safemode: Failed on local exception: java.io.IOException:
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException:
No valid credentials provided (Mechanism level: Failed to find any Kerberos
tgt)]; Host Details : local host is: "<host>/<ip>"; destination host is:
"<host>":8020;
# echo $?
255
# su - hdfs
[hdfs@<host> ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1008
Default principal: hdfs@LOCALDOMAIN
Valid starting Expires Service principal
04/13/15 16:10:59 04/14/15 16:10:59 krbtgt/LOCALDOMAIN@LOCALDOMAIN
renew until 04/20/15 16:10:59
[hdfs@<host> ~]$ /usr/bin/kinit -kt /etc/security/keytabs/hdfs.headless.keytab
hdfs
[hdfs@<host> ~]$ logout
# su - hdfs -c 'hdfs dfsadmin -safemode get'
Safe mode is OFF in <nn1>/<ip1>:8020
Safe mode is OFF in <nn2>/<ip2>:8020
{code}
It looks like the kerberos cached was initialized for root instead of the hdfs
user since the kinit command didn't have a su - hdfs with it.
I had retried once with the same result to get the error again for this jira,
but after I logged in as hdfs and manually kinit'd the hdfs user's krb cache
and retried again in Ambari it succeeded, so that is the workaround for now.
Hari Sekhon
http://www.linkedin.com/in/harisekhon
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)