Re: Review Request 41691: Namenode start fails when time taken to get out of safemode is more than 20 minutes. Additional patch

Eugene Chekanskiy Mon, 28 Dec 2015 11:03:44 -0800


> On Dec. 28, 2015, 4:58 p.m., Apache Ambari wrote:
> > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py,
> >  line 45
> > <https://reviews.apache.org/r/41691/diff/1/?file=1175382#file1175382line45>
> >
> >     I spoke to Aravindan about this.
> >     Consider what happens when the server time out value is.
> >     
> >     A. < 30 mins (default of 20): If NN takes more than 30 mins to come out 
> > of safemode, then the task will be aborted and the user will have to retry 
> > the step again (e.g., NameNode restart and wait again)
> >     
> >     B. 30 or higher: Then NN will wait up to 30 mins. If after 30 mins 
> > still in safemode, then the task will proceed.
> >     
> >     For a very large cluster, this can take much longer than 30 mins and 
> > we'll be in the same boat again.
> >     There are 2 other potential solutions:
> >     1. Have a timeout value in ambari.properties that is specific for 
> > waiting to leave safemode
> >     2. Pass in the value of the server timeout to the command. So if the 
> > user bumps it up to 40 mins, then NameNode can always wait up to x-5 mins.
> >     
> >     What do you think?
> 
> Sumit Mohanty wrote:
>     The problem here is that any limit we can configure could be smaller than 
> the time taken to come out of safe-mode. So we can define a new property to 
> capture NN timeout but it will still be a guess work as to what the value 
> should be. The long term solution seems to be a feature where the user can 
> tell Ambari to abort or continue to wait for NN to come out of the safemode. 
> Is it something that the EU does today?? (EU will allow users to retry, will 
> it?)
>     
>     This specific JIRA is tracking the problem of the default timeout being 
> out of sync with the default retry duration. So we should fix that and open a 
> new Task to discuss the solution for how to track getting out of the safemode 
> gracefully.
> 
> Apache Ambari wrote:
>     yes, both solutions I proposed would handle this. #2 is easiest to do. #1 
> would need any NameNode restart operation to change the default timeout value 
> of the task.


Agree that moving some advanced safemode-leaving mechanisms need to be 
discussed in seperate task. It is not much changes, but there are lot options 
how we can handle this and how it can be configured.


- Eugene


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41691/#review112008
-----------------------------------------------------------


On Dec. 23, 2015, 5:20 p.m., Dmitro Lisnichenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41691/
> -----------------------------------------------------------
> 
> (Updated Dec. 23, 2015, 5:20 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Eugene Chekanskiy, Sumit 
> Mohanty, and Vitalyi Brodetskyi.
> 
> 
> Bugs: AMBARI-14479
>     https://issues.apache.org/jira/browse/AMBARI-14479
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> Issue
> Namenode safemode check timeout value of 30mins is more than the server 
> timeout of 20mins for a task. Hence, the server kills the namenode startup 
> script if it takes more than 20mins to get out of safemode.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
>  1766c44 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py
>  67db735 
>   ambari-server/src/test/python/stacks/2.0.6/HDFS/test_namenode.py 399fd8d 
> 
> Diff: https://reviews.apache.org/r/41691/diff/
> 
> 
> Testing
> -------
> 
> mvn clean test
> 
> 
> Thanks,
> 
> Dmitro Lisnichenko
> 
>

Re: Review Request 41691: Namenode start fails when time taken to get out of safemode is more than 20 minutes. Additional patch

Reply via email to