> On Dec. 28, 2015, 4:58 p.m., Apache Ambari wrote: > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py, > > line 45 > > <https://reviews.apache.org/r/41691/diff/1/?file=1175382#file1175382line45> > > > > I spoke to Aravindan about this. > > Consider what happens when the server time out value is. > > > > A. < 30 mins (default of 20): If NN takes more than 30 mins to come out > > of safemode, then the task will be aborted and the user will have to retry > > the step again (e.g., NameNode restart and wait again) > > > > B. 30 or higher: Then NN will wait up to 30 mins. If after 30 mins > > still in safemode, then the task will proceed. > > > > For a very large cluster, this can take much longer than 30 mins and > > we'll be in the same boat again. > > There are 2 other potential solutions: > > 1. Have a timeout value in ambari.properties that is specific for > > waiting to leave safemode > > 2. Pass in the value of the server timeout to the command. So if the > > user bumps it up to 40 mins, then NameNode can always wait up to x-5 mins. > > > > What do you think? > > Sumit Mohanty wrote: > The problem here is that any limit we can configure could be smaller than > the time taken to come out of safe-mode. So we can define a new property to > capture NN timeout but it will still be a guess work as to what the value > should be. The long term solution seems to be a feature where the user can > tell Ambari to abort or continue to wait for NN to come out of the safemode. > Is it something that the EU does today?? (EU will allow users to retry, will > it?) > > This specific JIRA is tracking the problem of the default timeout being > out of sync with the default retry duration. So we should fix that and open a > new Task to discuss the solution for how to track getting out of the safemode > gracefully. > > Apache Ambari wrote: > yes, both solutions I proposed would handle this. #2 is easiest to do. #1 > would need any NameNode restart operation to change the default timeout value > of the task.
Agree that moving some advanced safemode-leaving mechanisms need to be discussed in seperate task. It is not much changes, but there are lot options how we can handle this and how it can be configured. - Eugene ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/41691/#review112008 ----------------------------------------------------------- On Dec. 23, 2015, 5:20 p.m., Dmitro Lisnichenko wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/41691/ > ----------------------------------------------------------- > > (Updated Dec. 23, 2015, 5:20 p.m.) > > > Review request for Ambari, Alejandro Fernandez, Eugene Chekanskiy, Sumit > Mohanty, and Vitalyi Brodetskyi. > > > Bugs: AMBARI-14479 > https://issues.apache.org/jira/browse/AMBARI-14479 > > > Repository: ambari > > > Description > ------- > > Issue > Namenode safemode check timeout value of 30mins is more than the server > timeout of 20mins for a task. Hence, the server kills the namenode startup > script if it takes more than 20mins to get out of safemode. > > > Diffs > ----- > > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py > 1766c44 > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py > 67db735 > ambari-server/src/test/python/stacks/2.0.6/HDFS/test_namenode.py 399fd8d > > Diff: https://reviews.apache.org/r/41691/diff/ > > > Testing > ------- > > mvn clean test > > > Thanks, > > Dmitro Lisnichenko > >
