[
https://issues.apache.org/jira/browse/AMBARI-18240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433819#comment-15433819
]
Hudson commented on AMBARI-18240:
---------------------------------
SUCCESS: Integrated in Jenkins build Ambari-trunk-Commit #5578 (See
[https://builds.apache.org/job/Ambari-trunk-Commit/5578/])
AMBARI-18240 - During a Rolling Downgrade Oozie Long Running Jobs Can (jhurley:
[http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=04a534ceacb1887c4666c97ea0d1a2670fe4a1cd])
* (edit) ambari-server/src/test/python/stacks/2.0.6/HDFS/test_namenode.py
* (edit)
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py
> During a Rolling Downgrade Oozie Long Running Jobs Can Fail
> -----------------------------------------------------------
>
> Key: AMBARI-18240
> URL: https://issues.apache.org/jira/browse/AMBARI-18240
> Project: Ambari
> Issue Type: Bug
> Components: ambari-server
> Affects Versions: 2.4.0
> Reporter: Jonathan Hurley
> Assignee: Jonathan Hurley
> Priority: Blocker
> Fix For: trunk
>
> Attachments: AMBARI-18240.patch
>
>
> - Install HDP-2.3.2.0-2950 with Ambari 2.4.0
> - Being a long-running job (LRJ) in Oozie
> - Start upgrading to HDP-2.5.0.0-1235
> - Before finalizing step, start downgrading to HDP-2.3.2.0-2950.
> Sometimes, the LRJ will fail:
> {code}
> /usr/hdp/current/oozie-client/bin/oozie job -oozie
> http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie
> -info 0000001-160821214718970-oozie-oozi-C@248
> ID : 0000001-160821214718970-oozie-oozi-C@248
> ------------------------------------------------------------------------------------------------------------------------------------
> Action Number : 248
> Console URL : -
> Error Code : -
> Error Message : -
> External ID : 0000030-160822042035608-oozie-oozi-W
> External Status : -
> Job ID : 0000001-160821214718970-oozie-oozi-C
> Tracker URI : -
> Created : 2016-08-22 00:37 GMT
> Nominal Time : 2009-01-01 21:35 GMT
> Status : FAILED
> Last Modified : 2016-08-22 05:15 GMT
> First Missing Dependency : -
> ------------------------------------------------------------------------------------------------------------------------------------
> [hrt_qa@natr66-grls-dlm10toeriedwngdsec-r6-21 ~]$
> /usr/hdp/current/oozie-client/bin/oozie job -oozie
> http://natr66-grls-dlm10toeriedwngdsec-r6-10.openstacklocal:11000/oozie
> -info 0000030-160822042035608-oozie-oozi-W
> Job ID : 0000030-160822042035608-oozie-oozi-W
> ------------------------------------------------------------------------------------------------------------------------------------
> Workflow Name : wordcount
> App Path : hdfs://nameservice/user/hrt_qa/test_oozie_long_running
> Status : FAILED
> Run : 0
> User : hrt_qa
> Group : -
> Created : 2016-08-22 05:08 GMT
> Started : 2016-08-22 05:08 GMT
> Last Modified : 2016-08-22 05:15 GMT
> Ended : 2016-08-22 05:15 GMT
> CoordAction ID: 0000001-160821214718970-oozie-oozi-C@248
> Actions
> ------------------------------------------------------------------------------------------------------------------------------------
> ID
> Status Ext ID Ext Status Err Code
> ------------------------------------------------------------------------------------------------------------------------------------
> 0000030-160822042035608-oozie-oozi-W@wc
> FAILED job_1471842441396_0002 FAILED JA017
> ------------------------------------------------------------------------------------------------------------------------------------
> 0000030-160822042035608-oozie-oozi-W@:start:
> OK - OK -
> ------------------------------------------------------------------------------------------------------------------------------------
> {code}
> This is caused by an outage of both NameNodes during the downgrade.
> - We have two NNs at the "Finalize Upgrade" state;
> -- nn1 is standby (out of safemode)
> -- nn2 is active (out of safemode)
> - A downgrade begins and we restart nn1
> -- After the restart of nn1, it hasn't come online yet. Our code tries to
> contact it and can't, so we move onto nn2.
> -- nn2 is online and active and out of safemode (because it hasn't been
> downgraded yet), so we let the downgrade continue
> - The downgrade continues and we restart nn2
> -- However, nn1 is still coming online and isn't even standby yet
> Now we have an nn1 which isn't fully loaded and an nn2 which is restarting
> and trying to figure out whether to be active or standby. It's during this
> gap that the tests must be failing.
> So, it seems like we need to be a little bit smarter about waiting for the
> namenode to restart; we can't just look at the "active" one and say things
> are OK because it might be the next one to restart.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)