> On June 17, 2016, 6:44 p.m., Di Li wrote:
> > Hello Victor,
> > 
> > so I ran some tests and observed the following. I have a 3 node cluster, 
> > c1.apache.org, c2.apache.org, and c3.apache.org
> > 
> > 1. Right after finishing the manual steps listed on step "Initialize 
> > Metadata". I noticed c1.apache.org has NameNode process running but it's 
> > the standby. c2.apache.org (the new NN added) has NN stopped.
> > 
> > 2. The state of the two NNs in #1 seems to have cause the NN's 
> > check_is_active_namenode function call to return False, thus setting 
> > ensure_safemode_off to False as well. >> Skipping the safemode check 
> > altegather.
> > 
> > 3. If I just ran safemode check command line hadoop cmd, here are the 
> > results, notice the safemode is reported as ON on the Standby node and the 
> > ther one is a connection refused err
> > 
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn1
> > standby
> > [hdfs@c1 ~]$ hdfs --config /usr/iop/current/hadoop-client/conf haadmin -ns 
> > binn -getServiceState nn2
> > 16/06/17 11:26:42 INFO ipc.Client: Retrying connect to server: 
> > c1.apache.org:8020. Already tried 0 time(s); retry policy is 
> > RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 
> > MILLISECONDS)
> > Operation failed: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > [hdfs@c1 ~]$ hdfs dfsadmin -safemode get
> > Safe mode is ON in c1.apache.org:8020
> > safemode: Call From c1.apache.org to c2.apache.org:8020 failed on 
> > connection exception: java.net.ConnectException: Connection refused; For 
> > more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> > 
> > So in my opinion, the fix should be at the NameNode Python script level to 
> > always check safemode against the two NNs, and make sure the safemode is 
> > off on the active namenode. As a safeguard against offline active NN, the 
> > check should eventually timeout to unblock the rest of the start sequence.
> 
> Victor Galgo wrote:
>     "So in my opinion, the fix should be at the NameNode Python script level 
> to always check safemode against the two NNs". 
>     We cannot do that. Because at that points all Datanodes are stopped. 
> Which means NN will never go out of safemode.

Please include Jonathan Hurley in the code review since he recently modified 
the function that waits to leave safemode.
This is not the first time that we've had the need for a step to "leave safe 
mode". So either we put it into the python code (and do a lot of testing on it 
since it also impacts EU and RU), or make a custom command for HDFS that is 
only available if HA is present, and it waits for NameNode to leave safemode.


- Alejandro


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/48734/#review138280
-----------------------------------------------------------


On June 15, 2016, 4:41 p.m., Victor Galgo wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/48734/
> -----------------------------------------------------------
> 
> (Updated June 15, 2016, 4:41 p.m.)
> 
> 
> Review request for Ambari, Andriy Babiichuk, Alexandr Antonenko, Andrew 
> Onischuk, Di Li, Dmitro Lisnichenko, Jayush Luniya, Robert Levas, Sandor 
> Magyari, Sumit Mohanty, Sebastian Toader, and Yusaku Sako.
> 
> 
> Bugs: AMBARI-17182
>     https://issues.apache.org/jira/browse/AMBARI-17182
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> On the last step "Start all" on enabling HA below happens:
> 
> Traceback (most recent call last):
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 147, in <module>
>       ApplicationTimelineServer().execute()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py",
>  line 219, in execute
>       method(env)
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 43, in start
>       self.configure(env) # FOR SECURITY
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/application_timeline_server.py",
>  line 54, in configure
>       yarn(name='apptimelineserver')
>     File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", 
> line 89, in thunk
>       return fn(*args, **kwargs)
>     File 
> "/var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/scripts/yarn.py",
>  line 276, in yarn
>       mode=0755
>     File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", 
> line 154, in __init__
>       self.env.run()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 160, in run
>       self.run_action(resource, action)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", 
> line 124, in run_action
>       provider_action()
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 463, in action_create_on_execute
>       self.action_delayed("create")
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 460, in action_delayed
>       self.get_hdfs_resource_executor().action_delayed(action_name, self)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 259, in action_delayed
>       self._set_mode(self.target_status)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 366, in _set_mode
>       self.util.run_command(self.main_resource.resource.target, 
> 'SETPERMISSION', method='PUT', permission=self.mode, assertable_result=False)
>     File 
> "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py",
>  line 195, in run_command
>       raise Fail(err_msg)
>   resource_management.core.exceptions.Fail: Execution of 'curl -sS -L -w 
> '%{http_code}' -X PUT 
> 'http://testvgalgo.org:50070/webhdfs/v1/ats/done?op=SETPERMISSION&user.name=hdfs&permission=755''
>  returned status_code=403. 
>   {
>     "RemoteException": {
>       "exception": "RetriableException", 
>       "javaClassName": "org.apache.hadoop.ipc.RetriableException", 
>       "message": "org.apache.hadoop.hdfs.server.namenode.SafeModeException: 
> Cannot set permission for /ats/done. Name node is in safe mode.\nThe reported 
> blocks 675 needs additional 16 blocks to reach the threshold 0.9900 of total 
> blocks 697.\nThe number of live datanodes 20 has reached the minimum number 
> 0. Safe mode will be turned off automatically once the thresholds have been 
> reached."
>     }
>   }
>   
>   
> This happens because NN is not yet out of safemode at the moment of ats 
> start, because DNs just started.
> 
> To fix this "stop namenodes" has to be triggered before "start all".
> 
> If this is done, on "Start all" it will be ensured that datanodes start prior 
> to NN, and that NN are out of safemode before ATS start.
> 
> 
> Diffs
> -----
> 
>   
> ambari-web/app/controllers/main/admin/highAvailability/nameNode/step9_controller.js
>  24677e4 
>   ambari-web/app/messages.js 6465812 
> 
> Diff: https://reviews.apache.org/r/48734/diff/
> 
> 
> Testing
> -------
> 
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> Calling set on destroyed view
> 
>   28668 tests complete (34 seconds)
>   154 tests pending
> 
> [INFO] 
> [INFO] --- apache-rat-plugin:0.11:check (default) @ ambari-web ---
> [INFO] 51 implicit excludes (use -debug for more details).
> [INFO] Exclude: .idea/**
> [INFO] Exclude: package.json
> [INFO] Exclude: public/**
> [INFO] Exclude: public-static/**
> [INFO] Exclude: app/assets/**
> [INFO] Exclude: vendor/**
> [INFO] Exclude: node_modules/**
> [INFO] Exclude: node/**
> [INFO] Exclude: npm-debug.log
> [INFO] 1425 resources included (use -debug for more details)
> Warning:  org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser: Property 
> 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
> recognized.
> Compiler warnings:
>   WARNING:  'org.apache.xerces.jaxp.SAXParserImpl: Property 
> 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.'
> Warning:  org.apache.xerces.parsers.SAXParser: Feature 
> 'http://javax.xml.XMLConstants/feature/secure-processing' is not recognized.
> Warning:  org.apache.xerces.parsers.SAXParser: Property 
> 'http://javax.xml.XMLConstants/property/accessExternalDTD' is not recognized.
> Warning:  org.apache.xerces.parsers.SAXParser: Property 
> 'http://www.oracle.com/xml/jaxp/properties/entityExpansionLimit' is not 
> recognized.
> [INFO] Rat check: Summary of files. Unapproved: 0 unknown: 0 generated: 0 
> approved: 1425 licence.
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] BUILD SUCCESS
> [INFO] 
> ------------------------------------------------------------------------
> [INFO] Total time: 1:31.015s
> [INFO] Finished at: Sun Jun 12 14:37:47 EEST 2016
> [INFO] Final Memory: 13M/407M
> [INFO] 
> ------------------------------------------------------------------------
> 
> Also to test this I have installed 3 nodes cluster and enabled namenode ha on 
> it.
> 
> 
> Thanks,
> 
> Victor Galgo
> 
>

Reply via email to