[ 
https://issues.apache.org/jira/browse/AMBARI-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Galgo updated AMBARI-17182:
----------------------------------
    Status: Patch Available  (was: Open)

> App timeline Server start fails on enabling HA because namenode is in safemode
> ------------------------------------------------------------------------------
>
>                 Key: AMBARI-17182
>                 URL: https://issues.apache.org/jira/browse/AMBARI-17182
>             Project: Ambari
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Victor Galgo
>            Priority: Critical
>              Labels: ha, namenode
>             Fix For: 2.4.0
>
>         Attachments: nnha_fix.patch
>
>
> On the last step "Start all" on enabling HA below happens:
> {code}
> Traceback (most recent call last):
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in <module>
>     ApplicationTimelineServer().execute()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>     method(env)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>     self.configure(env) # FOR SECURITY
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>     yarn(name='apptimelineserver')
>   File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>     return fn(*args, **kwargs)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>     mode=0755
>   File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>     self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>     self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>     provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>     self.action_delayed("create")
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>     self.get_hdfs_resource_executor().action_delayed(action_name, self)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>     self._set_mode(self.target_status)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>     self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>     raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://os-s11-3-iavzl-nat-s-ru242to25susesecha-12.openstacklocal:50070/webhdfs/v1/ats/done?op=SETPERMISSION&user.name=hdfs&permission=755''
>  returned status_code=403. 
> {
>   "RemoteException": {
>     "exception": "RetriableException", 
>     "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>     "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
>   }
> }
> {code}
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> To fix this "stop namenodes" has to be triggered before "start all".
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to