[jira] [Updated] (AMBARI-17182) App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-12 Thread Victor Galgo (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Galgo updated AMBARI-17182:
--
Status: Patch Available  (was: Open)

> App timeline Server start fails on enabling HA because namenode is in safemode
> --
>
> Key: AMBARI-17182
> URL: https://issues.apache.org/jira/browse/AMBARI-17182
> Project: Ambari
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Victor Galgo
>Priority: Critical
>  Labels: ha, namenode
> Fix For: 2.4.0
>
> Attachments: nnha_fix.patch
>
>
> On the last step "Start all" on enabling HA below happens:
> {code}
> Traceback (most recent call last):
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
> ApplicationTimelineServer().execute()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
> self.configure(env) # FOR SECURITY
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
> yarn(name='apptimelineserver')
>   File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
> return fn(*args, **kwargs)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
> mode=0755
>   File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
> self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
> self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
> provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
> self.action_delayed("create")
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
> self.get_hdfs_resource_executor().action_delayed(action_name, self)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
> self._set_mode(self.target_status)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
> self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://os-s11-3-iavzl-nat-s-ru242to25susesecha-12.openstacklocal:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
> {
>   "RemoteException": {
> "exception": "RetriableException", 
> "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
> "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
>   }
> }
> {code}
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> To fix this "stop namenodes" has to be triggered before "start all".
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AMBARI-17182) App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-12 Thread Victor Galgo (JIRA)

[ 
https://issues.apache.org/jira/browse/AMBARI-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326408#comment-15326408
 ] 

Victor Galgo commented on AMBARI-17182:
---

To test this I have installed 3 nodes cluster and enabled namenode ha on it.

> App timeline Server start fails on enabling HA because namenode is in safemode
> --
>
> Key: AMBARI-17182
> URL: https://issues.apache.org/jira/browse/AMBARI-17182
> Project: Ambari
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Victor Galgo
>Priority: Critical
>  Labels: ha, namenode
> Fix For: 2.4.0
>
> Attachments: nnha_fix.patch
>
>
> On the last step "Start all" on enabling HA below happens:
> {code}
> Traceback (most recent call last):
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
> ApplicationTimelineServer().execute()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
> self.configure(env) # FOR SECURITY
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
> yarn(name='apptimelineserver')
>   File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
> return fn(*args, **kwargs)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
> mode=0755
>   File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
> self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
> self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
> provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
> self.action_delayed("create")
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
> self.get_hdfs_resource_executor().action_delayed(action_name, self)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
> self._set_mode(self.target_status)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
> self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
> raise Fail(err_msg)
> resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://os-s11-3-iavzl-nat-s-ru242to25susesecha-12.openstacklocal:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
>  returned status_code=403. 
> {
>   "RemoteException": {
> "exception": "RetriableException", 
> "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
> "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
>   }
> }
> {code}
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> To fix this "stop namenodes" has to be triggered before "start all".
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AMBARI-17182) App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-12 Thread Victor Galgo (JIRA)
Victor Galgo created AMBARI-17182:
-

 Summary: App timeline Server start fails on enabling HA because 
namenode is in safemode
 Key: AMBARI-17182
 URL: https://issues.apache.org/jira/browse/AMBARI-17182
 Project: Ambari
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Victor Galgo
Priority: Critical
 Fix For: 2.4.0


On the last step "Start all" on enabling HA below happens:
{code}
Traceback (most recent call last):
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 147, in 
ApplicationTimelineServer().execute()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
 line 219, in execute
method(env)
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 43, in start
self.configure(env) # FOR SECURITY
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
 line 54, in configure
yarn(name='apptimelineserver')
  File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
line 89, in thunk
return fn(*args, **kwargs)
  File 
"/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
 line 276, in yarn
mode=0755
  File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
line 154, in __init__
self.env.run()
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 160, in run
self.run_action(resource, action)
  File 
"/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
line 124, in run_action
provider_action()
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 463, in action_create_on_execute
self.action_delayed("create")
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 460, in action_delayed
self.get_hdfs_resource_executor().action_delayed(action_name, self)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 259, in action_delayed
self._set_mode(self.target_status)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 366, in _set_mode
self.util.run_command(self.main_resource.resource.target, 'SETPERMISSION', 
method='PUT', permission=self.mode, assertable_result=False)
  File 
"/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
 line 195, in run_command
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
'%{http_code}' -X PUT 
'http://os-s11-3-iavzl-nat-s-ru242to25susesecha-12.openstacklocal:50070/webhdfs/v1/ats/done?op=SETPERMISSION=hdfs=755''
 returned status_code=403. 
{
  "RemoteException": {
"exception": "RetriableException", 
"javaClassName": "org.apache.hadoop.ipc.RetriableException", 
"message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
blocks 697.\nThe number of live datanodes 20 has reached the minimum number 0. 
Safe mode will be turned off automatically once the thresholds have been 
reached."
  }
}
{code}

This happens because NN is not yet out of safemode at the moment of ats start, 
because DNs just started.

To fix this "stop namenodes" has to be triggered before "start all".

If this is done, on "Start all" it will be ensured that datanodes start prior 
to NN, and that NN are out of safemode before ATS start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AMBARI-17182) App timeline Server start fails on enabling HA because namenode is in safemode

2016-06-12 Thread Victor Galgo (JIRA)

 [ 
https://issues.apache.org/jira/browse/AMBARI-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Galgo updated AMBARI-17182:
--
Attachment: nnha_fix.patch

I have run 'mvn clean test' for ambari-web. All tests pass:
{code}
Calling set on destroyed view
Calling set on destroyed view
Calling set on destroyed view
Calling set on destroyed view

  28668 tests complete (34 seconds)
  154 tests pending

[INFO] 
[INFO] --- apache-rat-plugin:0.11:check (default) @ ambari-web ---
[INFO] 51 implicit excludes (use -debug for more details).
[INFO] Exclude: .idea/**
[INFO] Exclude: package.json
[INFO] Exclude: public/**
[INFO] Exclude: public-static/**
[INFO] Exclude: app/assets/**
[INFO] Exclude: vendor/**
[INFO] Exclude: node_modules/**
[INFO] Exclude: node/**
[INFO] Exclude: npm-debug.log
[INFO] 1425 resources included (use -debug for more details)
Warning:  org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property 
'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
recognized.
Compiler warnings:
  WARNING:  'org.apache.xerces.jaxp.SAXParserImpl: Property 
'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'
Warning:  org.apache.xerces.parsers.SAXParser: Feature 
'http://javax.xml.XMLConstants/feature/secure-processing' is not recognized.
Warning:  org.apache.xerces.parsers.SAXParser: Property 
'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.
Warning:  org.apache.xerces.parsers.SAXParser: Property 
'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
recognized.
[INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 
approved: 1425 licence.
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 1:31.015s
[INFO] Finished at: Sun Jun 12 14:37:47 EEST 2016
[INFO] Final Memory: 13M/407M
[INFO] 
{code}

[~sumitmohanty] can you please help to commit this?

> App timeline Server start fails on enabling HA because namenode is in safemode
> --
>
> Key: AMBARI-17182
> URL: https://issues.apache.org/jira/browse/AMBARI-17182
> Project: Ambari
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Victor Galgo
>Priority: Critical
>  Labels: ha, namenode
> Fix For: 2.4.0
>
> Attachments: nnha_fix.patch
>
>
> On the last step "Start all" on enabling HA below happens:
> {code}
> Traceback (most recent call last):
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in 
> ApplicationTimelineServer().execute()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
> method(env)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
> self.configure(env) # FOR SECURITY
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
> yarn(name='apptimelineserver')
>   File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
> return fn(*args, **kwargs)
>   File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
> mode=0755
>   File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
> self.env.run()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
> self.run_action(resource, action)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
> provider_action()
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
> self.action_delayed("create")
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
> self.get_hdfs_resource_executor().action_delayed(action_name, self)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
> self._set_mode(self.target_status)
>   File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
> self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT',