Adding to Mona's comments, If the job fails with a transient error (not able to connect to JT or something else), then Oozie will retry a configurable no of times and will then make the job status as START_MANUAL. Now, recovery service will try to execute the jobs in this status and will succeed whenever Hadoop processes are online. If we have the JT recoverability, then both Oozie and JT will retry the same job with different id's.
I think Oozie might not be registering the error as a transient one when it sees a 'Unknown Hadoop Job' (you see this with YARN, right?). Can you check whether making that error register with Oozie fix the issue? Thanks, Virag On 8/6/13 10:26 AM, "Mona Chitnis" <chit...@yahoo-inc.com> wrote: >Adding to Robert's comment, > >Oozie retry currently does not take into account if JT is down or >undergoing restart etc. It retries (upto the user-configurable max) in >quick succession and then will give up. If JT is expected to be down >longer than avg (retry interval x retry times), then recovering on JT side >will be an advantage. However, in the case of a transient error and not a >larger maintenance window, wouldn't both Oozie and JT end up retrying the >same job? > > >On 8/6/13 9:59 AM, "Robert Kanter" <rkan...@cloudera.com> wrote: > >>I think you usually just get the "Unknown Hadoop Job" error message >>because >>Oozie tries to look up the Hadoop Job ID it already has, but the JT no >>longer has that ID because it was restarted. With JT Recoverability >>turned >>on, it will restart the job using the same ID, so Oozie continues just >>fine. >> >>- Robert >> >> >>On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy >><rohini.adi...@gmail.com>wrote: >> >>> Wouldn't oozie poll for the job status and decide that it has failed >>>and >>> when JT comes up launch another one if retry is configured? >>> >>> On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <rkan...@cloudera.com> >>> wrote: >>> >>> > Hi, >>> > >>> > We looked into how to support Job Recoverability (i.e. the JT is >>> restarted >>> > and it wants to restart the jobs that were running; similarly for >>>YARN) >>> and >>> > have a pretty simple solution for all of the action types except for >>> > MapReduce. If we set mapreduce.job.restart.recover=true for the >>>launcher >>> > job and mapreduce.job.restart.recover=false for the jobs launched by >>>the >>> > launcher, then when the JT restarts, it will recover the launcher job >>>but >>> > not the child jobs -- the launcher job will then take care of >>>relaunching >>> > the child jobs. >>> > >>> > For MapReduce, because of the optimization with the id swap, this >>>won't >>> > work. It would be very tricky, if it's even practical, to do >>>something >>> > similar for the MR action. Instead, we think it would be best if we >>> simply >>> > remove the MR optimization and make it just like the other action >>>types. >>> I >>> > know we normally don't want to remove optimizations, but there are >>>many >>> > advantages in this case, and it's only saving a single Map slot for >>>MR >>> jobs >>> > only. >>> > >>> > I've created OOZIE-1483 < >>> https://issues.apache.org/jira/browse/OOZIE-1483> >>> > with >>> > more details and should have a patch soon. >>> > >>> > Thoughts? >>> > >>> > >>> > thanks >>> > - Robert >>> > >>> >