[ 
https://issues.apache.org/jira/browse/AMBARI-17236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hurley updated AMBARI-17236:
-------------------------------------
    Attachment: AMBARI-17236.patch

> Namenode start step failed during EU with RetriableException
> ------------------------------------------------------------
>
>                 Key: AMBARI-17236
>                 URL: https://issues.apache.org/jira/browse/AMBARI-17236
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.4.0
>
>         Attachments: AMBARI-17236.patch
>
>
> *Steps*
> # Deploy HDP-2.3.4.0 cluster with Ambari 2.2.0.0 (secure, non-HA cluster with 
> custom service users)
> # Upgrade Ambari to 2.4.0.0-644
> # Register HDP-2.4.2.0 and install the bits
> # Start Express Upgrade
> Observed below error during start of NameNode:
> {code}
> Traceback (most recent call last):
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
>  line 414, in <module>
>     NameNode().execute()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 257, in execute
>     method(env)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 679, in restart
>     self.start(env, upgrade_type=upgrade_type)
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
>  line 101, in start
>     upgrade_suspended=params.upgrade_suspended, env=env)
>   File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>     return fn(*args, **kwargs)
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
>  line 216, in namenode
>     create_hdfs_directories()
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
>  line 283, in create_hdfs_directories
>     mode=0777,
>   File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 155, in __init__
>     self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>     self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>     provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 458, in action_create_on_execute
>     self.action_delayed("create")
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 455, in action_delayed
>     self.get_hdfs_resource_executor().action_delayed(action_name, self)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 246, in action_delayed
>     self._assert_valid()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 230, in _assert_valid
>     self.target_status = self._get_file_status(target)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 291, in _get_file_status
>     list_status = self.util.run_command(target, 'GETFILESTATUS', 
> method='GET', ignore_status_codes=['404'], assertable_result=False)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 191, in run_command
>     raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X GET --negotiate -u : 
> 'http://os-r6-gmcdns-dlm20todgm10sec-r6-5.openstacklocal:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=cstm-hdfs''
>  returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RetriableException", 
>     "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>     "message": "NameNode still not started"
>   }
> }
> {code}
> So, the heart of this issue is that, depending on topology and upgrade type, 
> we might not wait for NN to be out of Safe Mode after starting. However, we 
> are always creating directories, regardless of topology/upgrade:
> {code}
>     # Always run this on non-HA, or active NameNode during HA.
>     if is_active_namenode:
>       create_hdfs_directories()
>       create_ranger_audit_hdfs_directories()
> {code}
> NameNode, in Safe Mode, is read-only and would forbid this anyway, even if it 
> didn't throw a retryable exception:
> {code}
> [hdfs@c6403 root]$ hadoop fs -mkdir /foo
> mkdir: Cannot create directory /foo. Name node is in safe mode.
> {code}
> So, it seems like we need to wait for NN to be out of Safe Mode no matter 
> what.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to