Re: Review Request 41691: Namenode start fails when time taken to get out of safemode is more than 20 minutes. Additional patch

Sumit Mohanty Mon, 28 Dec 2015 09:11:44 -0800


> On Dec. 28, 2015, 4:58 p.m., Apache Ambari wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py,
> >  line 45
> > <https://reviews.apache.org/r/41691/diff/1/?file=1175382#file1175382line45>
> >
> >     I spoke to Aravindan about this.
> >     Consider what happens when the server time out value is.
> >     
> >     A. < 30 mins (default of 20): If NN takes more than 30 mins to come out 
> > of safemode, then the task will be aborted and the user will have to retry 
> > the step again (e.g., NameNode restart and wait again)
> >     
> >     B. 30 or higher: Then NN will wait up to 30 mins. If after 30 mins 
> > still in safemode, then the task will proceed.
> >     
> >     For a very large cluster, this can take much longer than 30 mins and 
> > we'll be in the same boat again.
> >     There are 2 other potential solutions:
> >     1. Have a timeout value in ambari.properties that is specific for 
> > waiting to leave safemode
> >     2. Pass in the value of the server timeout to the command. So if the 
> > user bumps it up to 40 mins, then NameNode can always wait up to x-5 mins.
> >     
> >     What do you think?

The problem here is that any limit we can configure could be smaller than the 
time taken to come out of safe-mode. So we can define a new property to capture 
NN timeout but it will still be a guess work as to what the value should be. 
The long term solution seems to be a feature where the user can tell Ambari to 
abort or continue to wait for NN to come out of the safemode. Is it something 
that the EU does today?? (EU will allow users to retry, will it?)

This specific JIRA is tracking the problem of the default timeout being out of 
sync with the default retry duration. So we should fix that and open a new Task 
to discuss the solution for how to track getting out of the safemode gracefully.

- Sumit

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41691/#review112008
-----------------------------------------------------------

On Dec. 23, 2015, 5:20 p.m., Dmitro Lisnichenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41691/
> -----------------------------------------------------------
> 
> (Updated Dec. 23, 2015, 5:20 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Eugene Chekanskiy, Sumit 
> Mohanty, and Vitalyi Brodetskyi.
> 
> 
> Bugs: AMBARI-14479
>     https://issues.apache.org/jira/browse/AMBARI-14479
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Issue
> Namenode safemode check timeout value of 30mins is more than the server 
> timeout of 20mins for a task. Hence, the server kills the namenode startup 
> script if it takes more than 20mins to get out of safemode.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
>  1766c44 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py
>  67db735 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_namenode.py 399fd8d 
> 
> Diff: https://reviews.apache.org/r/41691/diff/
> 
> 
> Testing
> -------
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Dmitro Lisnichenko
> 
>

Re: Review Request 41691: Namenode start fails when time taken to get out of safemode is more than 20 minutes. Additional patch

Reply via email to