[ 
https://issues.apache.org/jira/browse/AMBARI-17236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Hurley updated AMBARI-17236:
-------------------------------------
    Description: 
*Steps*
# Deploy HDP-2.3.4.0 cluster with Ambari 2.2.0.0 (secure, non-HA cluster with 
custom service users)
# Upgrade Ambari to 2.4.0.0-644
# Register HDP-2.4.2.0 and install the bits
# Start Express Upgrade

Observed below error during start of NameNode:
{code}
Traceback (most recent call last):
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
 line 414, in <module>
    NameNode().execute()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 257, in execute
    method(env)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 679, in restart
    self.start(env, upgrade_type=upgrade_type)
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
 line 101, in start
    upgrade_suspended=params.upgrade_suspended, env=env)
  File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
line 89, in thunk
    return fn(*args, **kwargs)
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
 line 216, in namenode
    create_hdfs_directories()
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
 line 283, in create_hdfs_directories
    mode=0777,
  File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
line 155, in __init__
    self.env.run()
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 160, in run
    self.run_action(resource, action)
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 124, in run_action
    provider_action()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 458, in action_create_on_execute
    self.action_delayed("create")
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 455, in action_delayed
    self.get_hdfs_resource_executor().action_delayed(action_name, self)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 246, in action_delayed
    self._assert_valid()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 230, in _assert_valid
    self.target_status = self._get_file_status(target)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 291, in _get_file_status
    list_status = self.util.run_command(target, 'GETFILESTATUS', method='GET', 
ignore_status_codes=['404'], assertable_result=False)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 191, in run_command
    raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
'%{http_code}' -X GET --negotiate -u : 
'http://os-r6-gmcdns-dlm20todgm10sec-r6-5.openstacklocal:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=cstm-hdfs''
 returned status_code=403. 
{
  "RemoteException": {
    "exception": "RetriableException", 
    "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
    "message": "NameNode still not started"
  }
}
{code}

So, the heart of this issue is that, depending on topology and upgrade type, we 
might not wait for NN to be out of Safe Mode after starting. However, we are 
always creating directories, regardless of topology/upgrade:

{code}
    # Always run this on non-HA, or active NameNode during HA.
    if is_active_namenode:
      create_hdfs_directories()
      create_ranger_audit_hdfs_directories()
{code}

NameNode, in Safe Mode, is read-only and would forbid this anyway, even if it 
didn't throw a retryable exception:
{code}
[hdfs@c6403 root]$ hadoop fs -mkdir /foo
mkdir: Cannot create directory /foo. Name node is in safe mode.
{code}

So, it seems like we need to wait for NN to be out of Safe Mode no matter what.

  was:
*Steps*
# Deploy HDP-2.3.4.0 cluster with Ambari 2.2.0.0 (secure, non-HA cluster with 
custom service users)
# Upgrade Ambari to 2.4.0.0-644
# Register HDP-2.4.2.0 and install the bits
# Start Express Upgrade

Observed below error during start of NameNode:
{code}
Traceback (most recent call last):
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
 line 414, in <module>
    NameNode().execute()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 257, in execute
    method(env)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 679, in restart
    self.start(env, upgrade_type=upgrade_type)
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
 line 101, in start
    upgrade_suspended=params.upgrade_suspended, env=env)
  File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
line 89, in thunk
    return fn(*args, **kwargs)
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
 line 216, in namenode
    create_hdfs_directories()
  File 
"/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
 line 283, in create_hdfs_directories
    mode=0777,
  File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
line 155, in __init__
    self.env.run()
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 160, in run
    self.run_action(resource, action)
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 124, in run_action
    provider_action()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 458, in action_create_on_execute
    self.action_delayed("create")
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 455, in action_delayed
    self.get_hdfs_resource_executor().action_delayed(action_name, self)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 246, in action_delayed
    self._assert_valid()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 230, in _assert_valid
    self.target_status = self._get_file_status(target)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 291, in _get_file_status
    list_status = self.util.run_command(target, 'GETFILESTATUS', method='GET', 
ignore_status_codes=['404'], assertable_result=False)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 191, in run_command
    raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
'%{http_code}' -X GET --negotiate -u : 
'http://os-r6-gmcdns-dlm20todgm10sec-r6-5.openstacklocal:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=cstm-hdfs''
 returned status_code=403. 
{
  "RemoteException": {
    "exception": "RetriableException", 
    "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
    "message": "NameNode still not started"
  }
}
{code}

So, the heart of this issue is that, depending on topology and upgrade type, we 
might not wait for NN to be out of Safe Mode after starting. However, we are 
always creating directories, regardless of topology/upgrade:

{code}
    # Always run this on non-HA, or active NameNode during HA.
    if is_active_namenode:
      create_hdfs_directories()
      create_ranger_audit_hdfs_directories()
{code}

NameNode, in Safe Mode, is read-only and would forbid this anyway, even if it 
didn't throw a retryable exception:
{code}
[hdfs@c6403 root]$ hadoop fs -mkdir /foo
mkdir: Cannot create directory /foo. Name node is in safe mode.
{code}

So, it seems like we need to wait for NN to be out of Safe Mode no matter what.

Looks like this was caused by:
AMBARI-16162. Reduce NN start time by removing redundant haadmin calls. 
(aonishuk)


> Namenode start step failed during EU with RetriableException
> ------------------------------------------------------------
>
>                 Key: AMBARI-17236
>                 URL: https://issues.apache.org/jira/browse/AMBARI-17236
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.4.0
>            Reporter: Jonathan Hurley
>            Assignee: Jonathan Hurley
>            Priority: Critical
>             Fix For: 2.4.0
>
>
> *Steps*
> # Deploy HDP-2.3.4.0 cluster with Ambari 2.2.0.0 (secure, non-HA cluster with 
> custom service users)
> # Upgrade Ambari to 2.4.0.0-644
> # Register HDP-2.4.2.0 and install the bits
> # Start Express Upgrade
> Observed below error during start of NameNode:
> {code}
> Traceback (most recent call last):
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
>  line 414, in <module>
>     NameNode().execute()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 257, in execute
>     method(env)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 679, in restart
>     self.start(env, upgrade_type=upgrade_type)
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py",
>  line 101, in start
>     upgrade_suspended=params.upgrade_suspended, env=env)
>   File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>     return fn(*args, **kwargs)
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
>  line 216, in namenode
>     create_hdfs_directories()
>   File 
> "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py",
>  line 283, in create_hdfs_directories
>     mode=0777,
>   File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 155, in __init__
>     self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>     self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>     provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 458, in action_create_on_execute
>     self.action_delayed("create")
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 455, in action_delayed
>     self.get_hdfs_resource_executor().action_delayed(action_name, self)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 246, in action_delayed
>     self._assert_valid()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 230, in _assert_valid
>     self.target_status = self._get_file_status(target)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 291, in _get_file_status
>     list_status = self.util.run_command(target, 'GETFILESTATUS', 
> method='GET', ignore_status_codes=['404'], assertable_result=False)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 191, in run_command
>     raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X GET --negotiate -u : 
> 'http://os-r6-gmcdns-dlm20todgm10sec-r6-5.openstacklocal:50070/webhdfs/v1/tmp?op=GETFILESTATUS&user.name=cstm-hdfs''
>  returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RetriableException", 
>     "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>     "message": "NameNode still not started"
>   }
> }
> {code}
> So, the heart of this issue is that, depending on topology and upgrade type, 
> we might not wait for NN to be out of Safe Mode after starting. However, we 
> are always creating directories, regardless of topology/upgrade:
> {code}
>     # Always run this on non-HA, or active NameNode during HA.
>     if is_active_namenode:
>       create_hdfs_directories()
>       create_ranger_audit_hdfs_directories()
> {code}
> NameNode, in Safe Mode, is read-only and would forbid this anyway, even if it 
> didn't throw a retryable exception:
> {code}
> [hdfs@c6403 root]$ hadoop fs -mkdir /foo
> mkdir: Cannot create directory /foo. Name node is in safe mode.
> {code}
> So, it seems like we need to wait for NN to be out of Safe Mode no matter 
> what.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to