----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/51337/#review146535 -----------------------------------------------------------
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py (lines 504 - 535) <https://reviews.apache.org/r/51337/#comment213042> It's very hard to see the changes in Reviewboard. I'll post the whole function here. Essentially we no longer short-circuit if we detect the _other_ NN as Active. We _must_ wait for this NameNode to register as _something_. ``` @retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail) def is_this_namenode_active(hdfs_binary): """ Gets whether the current NameNode is Active. This function will wait until the NameNode is listed as being either Active or Standby before returning a value. This is to ensure that that if the other NameNode is Active, we ensure that this NameNode has fully loaded and registered in the event that the other NameNode is going to be restarted. This prevents a situation where we detect the other NameNode as Active before this NameNode has fully booted. If the other Active NameNode is then restarted, there can be a loss of service if this NameNode has not entered Standby. """ import params # returns [active_namenodes, standby_namenodes, unknown_namenodes] namenode_states = namenode_ha_utils.get_namenode_states(params.hdfs_site, params.security_enabled, params.hdfs_user, times=5, sleep_time=5, backoff_factor=2 ) active_namenodes = namenode_states[0] standby_namenodes = namenode_states[1] # check to see if this is the active NameNode if params.namenode_id in active_namenodes: return True # if this is not the active NameNode, then we must wait for it to register as standby if params.namenode_id in standby_namenodes: return False # this this point, this NameNode is neither active nor standby - we must wait to ensure it # enters at least one of these roles before returning a verdict - the annotation will catch # this failure and retry the fuction automatically raise Fail("The NameNode {namenode_id} is not listed as Active or Standby, waiting...") ``` - Jonathan Hurley On Aug. 23, 2016, 12:09 p.m., Jonathan Hurley wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/51337/ > ----------------------------------------------------------- > > (Updated Aug. 23, 2016, 12:09 p.m.) > > > Review request for Ambari, Alejandro Fernandez, Dmytro Grinenko, Jayush > Luniya, and Nate Cole. > > > Bugs: AMBARI-18240 > https://issues.apache.org/jira/browse/AMBARI-18240 > > > Repository: ambari > > > Description > ------- > > This is caused by an outage of both NameNodes during the downgrade. > > - We have two NNs at the "Finalize Upgrade" state; > -- nn1 is standby (out of safemode) > -- nn2 is active (out of safemode) > - A downgrade begins and we restart nn1 > -- After the restart of nn1, it hasn't come online yet. Our code tries to > contact it and can't, so we move onto nn2. > -- nn2 is online and active and out of safemode (because it hasn't been > downgraded yet), so we let the downgrade continue > - The downgrade continues and we restart nn2 > -- However, nn1 is still coming online and isn't even standby yet > > Now we have an nn1 which isn't fully loaded and an nn2 which is restarting > and trying to figure out whether to be active or standby. It's during this > gap that the tests must be failing. > > So, it seems like we need to be a little bit smarter about waiting for the > namenode to restart; we can't just look at the "active" one and say things > are OK because it might be the next one to restart. > > > Diffs > ----- > > > ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py > 5a431aa > > Diff: https://reviews.apache.org/r/51337/diff/ > > > Testing > ------- > > PENDING > > > Thanks, > > Jonathan Hurley > >
