> On June 17, 2016, 5:44 p.m., Alejandro Fernandez wrote:
> > ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js,
> >  line 25
> > <https://reviews.apache.org/r/48734/diff/1/?file=1420112#file1420112line25>
> >
> >     How does this fix the issue? If NN just started, it still needs to get 
> > block reports, so ATS can still fail.
> 
> Victor Galgo wrote:
>     Alejandro thanks for having a look!
>     
>     This fixes the issue because when we do "Start All" later on, NNs start 
> is triggered before ATS start (role_command_order). And during NN start it 
> waits until safemode is off. To proceed with ATS and others.
> 
> Alejandro Fernandez wrote:
>     I still prefer a custom command to wait for NN to leave safemode. A 
> restart command has a timeout, so if it doesn't finish in time, a retry will 
> stop NN again, which we don't want.
>     Further, a start command also has a timeout, and another start command 
> may do a no-op since it's already started.
>     I think the logic is far cleaner and more maintainable with a custom 
> command just for leaving safemode.

I definetely agree you. Our "Start All" is fragile as of now.
But solution you proposed cannot be implemented. We cannot wait for NNs to 
leave safemode before "Start All" because at that points all DNs are stopped, 
which means that NN will never come out of safemode.


- Victor


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138263
-----------------------------------------------------------


On June 17, 2016, 10:45 p.m., Victor Galgo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> -----------------------------------------------------------
> 
> (Updated June 17, 2016, 10:45 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jonathan Hurley, Jayush Luniya, Robert 
> Levas, Sandor Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
>     https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in <module>
>       ApplicationTimelineServer().execute()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>       method(env)
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>       self.configure(env) # FOR SECURITY
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>       yarn(name='apptimelineserver')
>     File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>       return fn(*args, **kwargs)
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>       mode=0755
>     File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>       self.env.run()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>       self.run_action(resource, action)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>       provider_action()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>       self.action_delayed("create")
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>       self.get_hdfs_resource_executor().action_delayed(action_name, self)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>       self._set_mode(self.target_status)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>       self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>       raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION&user.name=hdfs&permission=755''
>  returned status_code=403. 
>   {
>     "RemoteException": {
>       "exception": "RetriableException", 
>       "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>       "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
>     }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -----
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: https://reviews.apache.org/r/48734/diff/
> 
> 
> Testing
> -------
> 
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> 
>   28668 tests complete (34 seconds)
>   154 tests pending
> 
> [INFO] 
> [INFO] --- apache-rat-plugin:0.11:check (default) @ ambari-web ---
> [INFO] 51 implicit excludes (use -debug for more details).
> [INFO] Exclude: .idea/**
> [INFO] Exclude: package.json
> [INFO] Exclude: public/**
> [INFO] Exclude: public-static/**
> [INFO] Exclude: app/assets/**
> [INFO] Exclude: vendor/**
> [INFO] Exclude: node_modules/**
> [INFO] Exclude: node/**
> [INFO] Exclude: npm-debug.log
> [INFO] 1425 resources included (use -debug for more details)
> Warning:  org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property 
> 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
> recognized.
> Compiler warnings:
>   WARNING:  'org.apache.xerces.jaxp.SAXParserImpl: Property 
> 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'
> Warning:  org.apache.xerces.parsers.SAXParser: Feature 
> 'http://javax.xml.XMLConstants/feature/secure-processing' is not recognized.
> Warning:  org.apache.xerces.parsers.SAXParser: Property 
> 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.
> Warning:  org.apache.xerces.parsers.SAXParser: Property 
> 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
> recognized.
> [INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 
> approved: 1425 licence.
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Total time: 1:31.015s
> [INFO] Finished at: Sun Jun 12 14:37:47 EEST 2016
> [INFO] Final Memory: 13M/407M
> [INFO] 
> ------------------------------------------------------------------------
> 
> Also to test this I have installed 3 nodes cluster and enabled namenode ha on 
> it.
> 
> 
> Thanks,
> 
> Victor Galgo
> 
>

Reply via email to