Re: Job Recoverability

Virag Kothari Tue, 06 Aug 2013 10:49:07 -0700

Adding to Mona's comments,

If the job fails with a transient error (not able to connect to JT or
something else), then
Oozie will retry a configurable no of times and will then make the job
status as START_MANUAL.
Now, recovery service will try to execute the jobs in this status and will
succeed whenever 
Hadoop processes are online.
If we have the JT recoverability, then both Oozie and JT will retry the
same job with different id's.


I think Oozie might not be registering the error as a transient one when
it sees a 'Unknown Hadoop Job' (you see this with YARN, right?).
Can you check whether making that error register with Oozie fix the issue?

Thanks,
Virag
  


On 8/6/13 10:26 AM, "Mona Chitnis" <chit...@yahoo-inc.com> wrote:

>Adding to Robert's comment,
>
>Oozie retry currently does not take into account if JT is down or
>undergoing restart etc. It retries (upto the user-configurable max) in
>quick succession and then will give up. If JT is expected to be down
>longer than avg (retry interval x retry times), then recovering on JT side
>will be an advantage. However, in the case of a transient error and not a
>larger maintenance window, wouldn't both Oozie and JT end up retrying the
>same job?
>
>
>On 8/6/13 9:59 AM, "Robert Kanter" <rkan...@cloudera.com> wrote:
>
>>I think you usually just get the "Unknown Hadoop Job" error message
>>because
>>Oozie tries to look up the Hadoop Job ID it already has, but the JT no
>>longer has that ID because it was restarted.  With JT Recoverability
>>turned
>>on, it will restart the job using the same ID, so Oozie continues just
>>fine.
>>
>>- Robert
>>
>>
>>On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
>><rohini.adi...@gmail.com>wrote:
>>
>>> Wouldn't oozie poll for the job status and decide that it has failed
>>>and
>>> when JT comes up launch another one if retry is configured?
>>>
>>> On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <rkan...@cloudera.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > We looked into how to support Job Recoverability (i.e. the JT is
>>> restarted
>>> > and it wants to restart the jobs that were running; similarly for
>>>YARN)
>>> and
>>> > have a pretty simple solution for all of the action types except for
>>> > MapReduce.  If we set mapreduce.job.restart.recover=true for the
>>>launcher
>>> > job and mapreduce.job.restart.recover=false for the jobs launched by
>>>the
>>> > launcher, then when the JT restarts, it will recover the launcher job
>>>but
>>> > not the child jobs -- the launcher job will then take care of
>>>relaunching
>>> > the child jobs.
>>> >
>>> > For MapReduce, because of the optimization with the id swap, this
>>>won't
>>> > work.  It would be very tricky, if it's even practical, to do
>>>something
>>> > similar for the MR action.  Instead, we think it would be best if we
>>> simply
>>> > remove the MR optimization and make it just like the other action
>>>types.
>>>  I
>>> > know we normally don't want to remove optimizations, but there are
>>>many
>>> > advantages in this case, and it's only saving a single Map slot for
>>>MR
>>> jobs
>>> > only.
>>> >
>>> > I've created OOZIE-1483 <
>>> https://issues.apache.org/jira/browse/OOZIE-1483>
>>> > with
>>> > more details and should have a patch soon.
>>> >
>>> > Thoughts?
>>> >
>>> >
>>> > thanks
>>> > - Robert
>>> >
>>>
>

Re: Job Recoverability

Reply via email to