-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51337/#review146535
-----------------------------------------------------------




ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
 (lines 504 - 535)
<https://reviews.apache.org/r/51337/#comment213042>

    It's very hard to see the changes in Reviewboard. I'll post the whole 
function here. Essentially we no longer short-circuit if we detect the _other_ 
NN as Active. We _must_ wait for this NameNode to register as _something_.
    
    ```
    @retry(times=5, sleep_time=5, backoff_factor=2, err_class=Fail)
    def is_this_namenode_active(hdfs_binary):
      """
      Gets whether the current NameNode is Active. This function will wait 
until the NameNode is
      listed as being either Active or Standby before returning a value. This 
is to ensure that
      that if the other NameNode is Active, we ensure that this NameNode has 
fully loaded and
      registered in the event that the other NameNode is going to be restarted. 
This prevents
      a situation where we detect the other NameNode as Active before this 
NameNode has fully booted.
      If the other Active NameNode is then restarted, there can be a loss of 
service if this
      NameNode has not entered Standby.
      """
      import params
    
      # returns [active_namenodes, standby_namenodes, unknown_namenodes]
      namenode_states = namenode_ha_utils.get_namenode_states(params.hdfs_site, 
params.security_enabled,
        params.hdfs_user, times=5, sleep_time=5, backoff_factor=2 )
    
      active_namenodes = namenode_states[0]
      standby_namenodes = namenode_states[1]
    
      # check to see if this is the active NameNode
      if params.namenode_id in active_namenodes:
        return True
    
      # if this is not the active NameNode, then we must wait for it to 
register as standby
      if params.namenode_id in standby_namenodes:
        return False
    
      # this this point, this NameNode is neither active nor standby - we must 
wait to ensure it
      # enters at least one of these roles before returning a verdict - the 
annotation will catch
      # this failure and retry the fuction automatically
      raise Fail("The NameNode {namenode_id} is not listed as Active or 
Standby, waiting...")
    ```


- Jonathan Hurley


On Aug. 23, 2016, 12:09 p.m., Jonathan Hurley wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51337/
> -----------------------------------------------------------
> 
> (Updated Aug. 23, 2016, 12:09 p.m.)
> 
> 
> Review request for Ambari, Alejandro Fernandez, Dmytro Grinenko, Jayush 
> Luniya, and Nate Cole.
> 
> 
> Bugs: AMBARI-18240
>     https://issues.apache.org/jira/browse/AMBARI-18240
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> This is caused by an outage of both NameNodes during the downgrade. 
> 
> - We have two NNs at the "Finalize Upgrade" state; 
> -- nn1 is standby (out of safemode)
> -- nn2 is active (out of safemode)
> - A downgrade begins and we restart nn1
> -- After the restart of nn1, it hasn't come online yet. Our code tries to 
> contact it and can't, so we move onto nn2.
> -- nn2 is online and active and out of safemode (because it hasn't been 
> downgraded yet), so we let the downgrade continue
> - The downgrade continues and we restart nn2
> -- However, nn1 is still coming online and isn't even standby yet
> 
> Now we have an nn1 which isn't fully loaded and an nn2 which is restarting 
> and trying to figure out whether to be active or standby. It's during this 
> gap that the tests must be failing. 
> 
> So, it seems like we need to be a little bit smarter about waiting for the 
> namenode to restart; we can't just look at the "active" one and say things 
> are OK because it might be the next one to restart.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
>  5a431aa 
> 
> Diff: https://reviews.apache.org/r/51337/diff/
> 
> 
> Testing
> -------
> 
> PENDING
> 
> 
> Thanks,
> 
> Jonathan Hurley
> 
>

Reply via email to