Jonathan Hurley created AMBARI-18240:
----------------------------------------
Summary: During a Rolling Downgrade Oozie Long Running Jobs Can
Fail
Key: AMBARI-18240
URL: https://issues.apache.org/jira/browse/AMBARI-18240
Project: Ambari
Issue Type: Bug
Components: ambari-server
Affects Versions: 2.4.0
Reporter: Jonathan Hurley
Assignee: Jonathan Hurley
Priority: Blocker
Fix For: trunk
- Install HDP-2.3.2.0-2950 with Ambari 2.4.0
- Being a long-running job (LRJ) in Oozie
- Start upgrading to HDP-2.5.0.0-1235
- Before finalizing step, start downgrading to HDP-2.3.2.0-2950.
Sometimes, the LRJ will fail:
{code}
/usr/hdp/current/oozie-client/bin/oozie job -oozie
http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie -info
0000001-160821214718970-oozie-oozi-C@248
ID : 0000001-160821214718970-oozie-oozi-C@248
------------------------------------------------------------------------------------------------------------------------------------
Action Number : 248
Console URL : -
Error Code : -
Error Message : -
External ID : 0000030-160822042035608-oozie-oozi-W
External Status : -
Job ID : 0000001-160821214718970-oozie-oozi-C
Tracker URI : -
Created : 2016-08-22 00:37 GMT
Nominal Time : 2009-01-01 21:35 GMT
Status : FAILED
Last Modified : 2016-08-22 05:15 GMT
First Missing Dependency : -
------------------------------------------------------------------------------------------------------------------------------------
[hrt_qa@natr66-grls-dlm10toeriedwngdsec-r6-21 ~]$
/usr/hdp/current/oozie-client/bin/oozie job -oozie
http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie -info
0000030-160822042035608-oozie-oozi-W
Job ID : 0000030-160822042035608-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : wordcount
App Path : hdfs://nameservice/user/hrt_qa/test_oozie_long_running
Status : FAILED
Run : 0
User : hrt_qa
Group : -
Created : 2016-08-22 05:08 GMT
Started : 2016-08-22 05:08 GMT
Last Modified : 2016-08-22 05:15 GMT
Ended : 2016-08-22 05:15 GMT
CoordAction ID: 0000001-160821214718970-oozie-oozi-C@248
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID
Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0000030-160822042035608-oozie-oozi-W@wc
FAILED job_1471842441396_0002 FAILED JA017
------------------------------------------------------------------------------------------------------------------------------------
0000030-160822042035608-oozie-oozi-W@:start:
OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
{code}
This is caused by an outage of both NameNodes during the downgrade.
- We have two NNs at the "Finalize Upgrade" state;
-- nn1 is standby (out of safemode)
-- nn2 is active (out of safemode)
- A downgrade begins and we restart nn1
-- After the restart of nn1, it hasn't come online yet. Our code tries to
contact it and can't, so we move onto nn2.
-- nn2 is online and active and out of safemode (because it hasn't been
downgraded yet), so we let the downgrade continue
- The downgrade continues and we restart nn2
-- However, nn1 is still coming online and isn't even standby yet
Now we have an nn1 which isn't fully loaded and an nn2 which is restarting and
trying to figure out whether to be active or standby. It's during this gap that
the tests must be failing.
So, it seems like we need to be a little bit smarter about waiting for the
namenode to restart; we can't just look at the "active" one and say things are
OK because it might be the next one to restart.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)