[ https://issues.apache.org/jira/browse/AMBARI-17236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Hurley updated AMBARI-17236: ------------------------------------- Attachment: AMBARI-17236.patch > Namenode start step failed during EU with RetriableException > ------------------------------------------------------------ > > Key: AMBARI-17236 > URL: https://issues.apache.org/jira/browse/AMBARI-17236 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.4.0 > Reporter: Jonathan Hurley > Assignee: Jonathan Hurley > Priority: Critical > Fix For: 2.4.0 > > Attachments: AMBARI-17236.patch > > > *Steps* > # Deploy HDP-2.3.4.0 cluster with Ambari 2.2.0.0 (secure, non-HA cluster with > custom service users) > # Upgrade Ambari to 2.4.0.0-644 > # Register HDP-2.4.2.0 and install the bits > # Start Express Upgrade > Observed below error during start of NameNode: > {code} > Traceback (most recent call last): > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", > line 414, in <module> > NameNode().execute() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 257, in execute > method(env) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", > line 679, in restart > self.start(env, upgrade_type=upgrade_type) > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", > line 101, in start > upgrade_suspended=params.upgrade_suspended, env=env) > File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", > line 89, in thunk > return fn(*args, **kwargs) > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", > line 216, in namenode > create_hdfs_directories() > File > "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", > line 283, in create_hdfs_directories > mode=0777, > File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", > line 155, in __init__ > self.env.run() > File > "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", > line 160, in run > self.run_action(resource, action) > File > "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", > line 124, in run_action > provider_action() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 458, in action_create_on_execute > self.action_delayed("create") > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 455, in action_delayed > self.get_hdfs_resource_executor().action_delayed(action_name, self) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 246, in action_delayed > self._assert_valid() > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 230, in _assert_valid > self.target_status = self._get_file_status(target) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 291, in _get_file_status > list_status = self.util.run_command(target, 'GETFILESTATUS', > method='GET', ignore_status_codes=['404'], assertable_result=False) > File > "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", > line 191, in run_command > raise Fail(err_msg) > resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w > '%{http_code}' -X GET --negotiate -u : > 'http://os-r6-gmcdns-dlm20todgm10sec-r6-5.openstacklocal:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=cstm-hdfs'' > returned status_code=403. > { > "RemoteException": { > "exception": "RetriableException", > "javaClassName": "org.apache.hadoop.ipc.RetriableException", > "message": "NameNode still not started" > } > } > {code} > So, the heart of this issue is that, depending on topology and upgrade type, > we might not wait for NN to be out of Safe Mode after starting. However, we > are always creating directories, regardless of topology/upgrade: > {code} > # Always run this on non-HA, or active NameNode during HA. > if is_active_namenode: > create_hdfs_directories() > create_ranger_audit_hdfs_directories() > {code} > NameNode, in Safe Mode, is read-only and would forbid this anyway, even if it > didn't throw a retryable exception: > {code} > [hdfs@c6403 root]$ hadoop fs -mkdir /foo > mkdir: Cannot create directory /foo. Name node is in safe mode. > {code} > So, it seems like we need to wait for NN to be out of Safe Mode no matter > what. -- This message was sent by Atlassian JIRA (v6.3.4#6332)