Re: Review Request 51337: During a Rolling Downgrade Oozie Long Running Jobs Can Fail

Jonathan Hurley Tue, 23 Aug 2016 09:18:10 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51337/
-----------------------------------------------------------


(Updated Aug. 23, 2016, 12:17 p.m.)


Review request for Ambari, Alejandro Fernandez, Dmytro Grinenko, Jayush Luniya, 
and Nate Cole.


Bugs: AMBARI-18240
    https://issues.apache.org/jira/browse/AMBARI-18240


Repository: ambari


Description
-------

This is caused by an outage of both NameNodes during the downgrade. 

- We have two NNs at the "Finalize Upgrade" state; 
-- nn1 is standby (out of safemode)
-- nn2 is active (out of safemode)
- A downgrade begins and we restart nn1
-- After the restart of nn1, it hasn't come online yet. Our code tries to 
contact it and can't, so we move onto nn2.
-- nn2 is online and active and out of safemode (because it hasn't been 
downgraded yet), so we let the downgrade continue
- The downgrade continues and we restart nn2
-- However, nn1 is still coming online and isn't even standby yet

Now we have an nn1 which isn't fully loaded and an nn2 which is restarting and 
trying to figure out whether to be active or standby. It's during this gap that 
the tests must be failing. 

So, it seems like we need to be a little bit smarter about waiting for the 
namenode to restart; we can't just look at the "active" one and say things are 
OK because it might be the next one to restart.


Diffs (updated)
-----

  
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
 5a431aa 

Diff: https://reviews.apache.org/r/51337/diff/


Testing
-------

PENDING


Thanks,

Jonathan Hurley

Re: Review Request 51337: During a Rolling Downgrade Oozie Long Running Jobs Can Fail

Reply via email to