Re: Job Recoverability

Robert Kanter Tue, 06 Aug 2013 16:40:36 -0700

If ActionCheckX is trying to retry, and the JT recovers the job, that
should be fine.  The "retry" is to simply try connecting to the JT to get
the status for the job.  If the user issues a "RESUME" for a START_MANUAL
job, then yes, Oozie will try to resubmit a new job for that action and
we'd have two of them if the JT also recovers it.


What if we modified the ActionStartXCommand/ResumeActionXCommand
precondition to check if the action already has a Job ID that is valid
(i.e. not unknown to the JT), then it fails the precondition check or
something similar?

- Robert


On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari <vi...@yahoo-inc.com> wrote:

> ActionCheckx first retries for a configurable amount of time and then
> makes the status as START_MANUAL.
> So, the problem might happen when JT recovers the job during the same time
> when 1) ActionCheckX is trying to retry or the 2) user issues a "RESUME"
> for a start_manual job.
> We have to ensure that this doesn't happen otherwise we will have two
> hadoop jobs for the same action.
> The callback happens only when the task is completed which might be too
> late. During that time, Oozie might have already submitted a new hadoop
> job for that wf action.
> So it doesn't seem straightforward to prevent Oozie to submit a new job if
> the JT is already recovering the older one.
>
>
>
> On 8/6/13 4:01 PM, "Robert Kanter" <rkan...@cloudera.com> wrote:
>
> >Yes, if JT recovers the job, it uses the same ID.  If the JT comes up
> >quickly and recovers the job, Oozie continues working just fine (without
> >the ID swap issues discussed earlier).  When the JT takes longer than the
> >10min ActionCheck interval, and the action is START_MANUAL, that still
> >needs to be figured out.
> >
> >I haven't tested on Hadoop 2.x yet, but I've been told that it should have
> >the same behavior.  The only differences are that the name of the property
> >to enable recoverability on the server (not the job-level one) is
> >different
> >obviously because it doesn't have "jobtracker" in it and it can also
> >recover the completed tasks, which shouldn't be a problem because the
> >launcher jar has the one task.  I'll of course double check this though.
> >
> >
> >- Robert
> >
> >
> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy
> ><rohini.adi...@gmail.com>wrote:
> >
> >> Robert,
> >>     You will not get a unknown hadoop job if JT has retry configured
> >>right?
> >> What happens in that case? Especially what happens when Oozie retry
> >>happens
> >> when JT comes up quickly?  Also do you know what is the behaviour with
> >> Hadoop 2.x ?
> >>
> >> Mayank,
> >>   OOZIE-1231 already has the changes to show Mapreduce job id in the
> >>Child
> >> job page to be consistent with other job types. The v1 API has the older
> >> behaviour with map job url in externalId, while v2 API has it in
> >> childjobids.  So there is a UI change but v1 REST API has not changed.
> >>But
> >> OOZIE-1231 has not changed any code with respect to id swap.
> >>
> >> Regards,
> >> Rohini
> >>
> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter <rkan...@cloudera.com>
> >> wrote:
> >>
> >> > Ya, I saw a precondition failed message.
> >> >
> >> > I just tried out what happens when the job is SUSPENDED, the action is
> >> > START_MANUAL, and the JT recovers the hadoop job: It doesn't continue
> >>the
> >> > workflow.  It fails the eagerVerifyPrecondition from
> >> > CompletedActionXCommand because the action isn't RUNNING.  Perhaps we
> >> > should make the CallbackService change the status in this situation?
> >> >
> >> > Just to clarify, the above only happens when the JT has been down long
> >> > enough that the ActionCheckXCommand (every 10min by default) + the
> >> retries
> >> > (3 x 1min) happen.  If it comes back sooner than that, everything
> >>works
> >> > fine.
> >> >
> >> > thanks
> >> > - Robert
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari <vi...@yahoo-inc.com>
> >> wrote:
> >> >
> >> > > Oh..okay. Seems like RecoveryService queues the StartX command but
> >>the
> >> > > verifyPrecondition() fails as the wf job is
> >> > > Suspended (Plz verify this from logs).
> >> > >
> >> > > In that case, if Oozie is not auto-retrying and resubmitting, then
> >>it
> >> > > seems fair to have the JT recover the job.
> >> > > But if JT recovers the job, can we make sure that the workflow job
> >> > > transits to RUNNING from SUSPENDED and wf action from START_MANUAL
> >>to
> >> > > RUNNING?
> >> > > It should not happen that the user resumes the job which makes Oozie
> >> > > submit a new hadoop job while the JT is also recovering the same
> >>job.
> >> > > Also, I think the error can still be considered transient from Oozie
> >> > > perspective as it is temporary depending on state of JT.
> >> > >
> >> > > Thanks,
> >> > > Virag
> >> > >
> >> > >
> >> > > On 8/6/13 1:12 PM, "Robert Kanter" <rkan...@cloudera.com> wrote:
> >> > >
> >> > > >Virag,
> >> > > >I just tested out killing the JT and waiting for the Checker
> >>service
> >> to
> >> > > >retry and give up: the action goes to START_MANUAL and the job gets
> >> > > >SUSPENDED.  I waited around long enough, but the RecoveryService
> >> didn't
> >> > do
> >> > > >anything.  Does it kick in for you?  As a side note, looking at the
> >> > code,
> >> > > >the RecoveryService looks like it can handle START_MANUAL,
> >>END_MANUAL,
> >> > and
> >> > > >USER_RETRY, which all sound like things the user should be doing;
> >>is
> >> it
> >> > > >correct that RecoveryService is handling these?
> >> > > >The Unknown Hadoop Job error happens when the JT comes back in time
> >> > > >because
> >> > > >it won't know about the old ID if its not recovering jobs.  So,
> >>Oozie
> >> > > >tries
> >> > > >to ask it about a job that no longer exists.  I'm not sure that
> >>this
> >> > > >should
> >> > > >be a transient error because there's no way to determine if its
> >> because
> >> > > >the
> >> > > >JT restarted and Oozie should resubmit the job or if something else
> >> > > >happened.
> >> > > >
> >> > > >Mayank,
> >> > > >That is a good point.  We could either make a v3 API or add an
> >> > oozie-site
> >> > > >config to turn on/off the id swap behavior and keep the v2 API.
> >> > > >
> >> > > >thanks
> >> > > >- Robert
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal <may...@apache.org>
> >> > wrote:
> >> > > >
> >> > > >> Robert,
> >> > > >>
> >> > > >> Thats a break in backward compatibility. Till now user are used
> >>to
> >> > > >>click on
> >> > > >> to link to go to MR page.
> >> > > >>
> >> > > >> Is there a better way to handle this?
> >> > > >>
> >> > > >> Thanks,
> >> > > >> Mayank
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter <
> >> rkan...@cloudera.com>
> >> > > >> wrote:
> >> > > >>
> >> > > >> > Mona,
> >> > > >> > As far as I'm aware, the "retry" that Oozie is doing is just
> >> > retrying
> >> > > >>to
> >> > > >> > connect to the JT (which is why when the JT comes back up,
> >>Oozie
> >> > > >> > can continue monitoring the hadoop job if it still has the same
> >> ID);
> >> > > >>it
> >> > > >> > doesn't try to submit the job again as part of the "retry".
> >> > > >> >
> >> > > >> > Mayank,
> >> > > >> > We can put the ID for the actual job in the Child IDs tab (like
> >> with
> >> > > >> Pig).
> >> > > >> >
> >> > > >> >
> >> > > >> > - Robert
> >> > > >> >
> >> > > >> >
> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal
> >><may...@apache.org
> >> >
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > > I agree , we should handle these two scenarios, I am ok with
> >> > > >>changing
> >> > > >> the
> >> > > >> > > launcher behavior for MR however if we remove the id swap
> >>then
> >> how
> >> > > >>we
> >> > > >> > > nevigate to MR jobs from UI as we do right now?
> >> > > >> > >
> >> > > >> > > Thanks,
> >> > > >> > > Mayank
> >> > > >> > >
> >> > > >> > >
> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter
> >> > > >><rkan...@cloudera.com>
> >> > > >> > > wrote:
> >> > > >> > >
> >> > > >> > > > Suppose we leave the MR ID swap thing as is but set the
> >> launcher
> >> > > >> > recover
> >> > > >> > > to
> >> > > >> > > > 0 and job to 1; then consider these two scenarios:
> >> > > >> > > >
> >> > > >> > > > 1. JT gets restarted during the launcher job but before the
> >> > > >>launcher
> >> > > >> > job
> >> > > >> > > > actually launches the real job:
> >> > > >> > > >      - The launcher job won't be recovered because we told
> >>it
> >> > not
> >> > > >>to
> >> > > >> > > >      - The real job was never launched
> >> > > >> > > >      ---> Action never completes and Oozie marks it as
> >>failed
> >> > > >> > > >
> >> > > >> > > > 2. Launcher job submits the real job, but JT gets restarted
> >> > before
> >> > > >> the
> >> > > >> > > > Oozie server has a chance to swap IDs (its not an atomic
> >> > > >>operation):
> >> > > >> > > >      - The launcher job won't be recovered because we told
> >>it
> >> > not
> >> > > >>to
> >> > > >> > > >      - The real job will be recovered and finish
> >>successfully
> >> > > >> > > >      ---> Oozie marks the action as failed even though the
> >> > actual
> >> > > >>job
> >> > > >> > > > succeeded because it didn't know about the ID swap
> >> > > >> > > >
> >> > > >> > > > It would only work for the case where the JT gets restarted
> >> > after
> >> > > >>the
> >> > > >> > ID
> >> > > >> > > > swap occurs.
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > - Robert
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal <
> >> > may...@apache.org
> >> > > >
> >> > > >> > > wrote:
> >> > > >> > > >
> >> > > >> > > > > Hi Robert,
> >> > > >> > > > >
> >> > > >> > > > > +1 for oozie to set launcher to 1 and 0 to jobs for
> >>recovery
> >> > in
> >> > > >>all
> >> > > >> > the
> >> > > >> > > > > cases except MR.
> >> > > >> > > > >
> >> > > >> > > > > As after Id swapped Oozie only know about MR job isn't
> >>it?
> >> > then
> >> > > >> there
> >> > > >> > > > > should not be any problem.
> >> > > >> > > > >
> >> > > >> > > > > If we set MR launcher recover to 0 and job to 1 then job
> >> will
> >> > be
> >> > > >> > > succeded
> >> > > >> > > > > in case of JT restart.
> >> > > >> > > > >
> >> > > >> > > > > AM I missing something?
> >> > > >> > > > >
> >> > > >> > > > > Thanks,
> >> > > >> > > > > Mayank
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter <
> >> > > >> rkan...@cloudera.com>
> >> > > >> > > > > wrote:
> >> > > >> > > > >
> >> > > >> > > > > > I think you usually just get the "Unknown Hadoop Job"
> >> error
> >> > > >> message
> >> > > >> > > > > because
> >> > > >> > > > > > Oozie tries to look up the Hadoop Job ID it already
> >>has,
> >> but
> >> > > >>the
> >> > > >> JT
> >> > > >> > > no
> >> > > >> > > > > > longer has that ID because it was restarted.  With JT
> >> > > >> > Recoverability
> >> > > >> > > > > turned
> >> > > >> > > > > > on, it will restart the job using the same ID, so Oozie
> >> > > >>continues
> >> > > >> > > just
> >> > > >> > > > > > fine.
> >> > > >> > > > > >
> >> > > >> > > > > > - Robert
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini Palaniswamy
> >> > > >> > > > > > <rohini.adi...@gmail.com>wrote:
> >> > > >> > > > > >
> >> > > >> > > > > > > Wouldn't oozie poll for the job status and decide
> >>that
> >> it
> >> > > >>has
> >> > > >> > > failed
> >> > > >> > > > > and
> >> > > >> > > > > > > when JT comes up launch another one if retry is
> >> > configured?
> >> > > >> > > > > > >
> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert Kanter <
> >> > > >> > > rkan...@cloudera.com>
> >> > > >> > > > > > > wrote:
> >> > > >> > > > > > >
> >> > > >> > > > > > > > Hi,
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > We looked into how to support Job Recoverability
> >>(i.e.
> >> > > >>the JT
> >> > > >> > is
> >> > > >> > > > > > > restarted
> >> > > >> > > > > > > > and it wants to restart the jobs that were running;
> >> > > >>similarly
> >> > > >> > for
> >> > > >> > > > > YARN)
> >> > > >> > > > > > > and
> >> > > >> > > > > > > > have a pretty simple solution for all of the action
> >> > types
> >> > > >> > except
> >> > > >> > > > for
> >> > > >> > > > > > > > MapReduce.  If we set
> >> mapreduce.job.restart.recover=true
> >> > > >>for
> >> > > >> > the
> >> > > >> > > > > > launcher
> >> > > >> > > > > > > > job and mapreduce.job.restart.recover=false for the
> >> jobs
> >> > > >> > launched
> >> > > >> > > > by
> >> > > >> > > > > > the
> >> > > >> > > > > > > > launcher, then when the JT restarts, it will
> >>recover
> >> the
> >> > > >> > launcher
> >> > > >> > > > job
> >> > > >> > > > > > but
> >> > > >> > > > > > > > not the child jobs -- the launcher job will then
> >>take
> >> > > >>care of
> >> > > >> > > > > > relaunching
> >> > > >> > > > > > > > the child jobs.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > For MapReduce, because of the optimization with
> >>the id
> >> > > >>swap,
> >> > > >> > this
> >> > > >> > > > > won't
> >> > > >> > > > > > > > work.  It would be very tricky, if it's even
> >> practical,
> >> > > >>to do
> >> > > >> > > > > something
> >> > > >> > > > > > > > similar for the MR action.  Instead, we think it
> >>would
> >> > be
> >> > > >> best
> >> > > >> > if
> >> > > >> > > > we
> >> > > >> > > > > > > simply
> >> > > >> > > > > > > > remove the MR optimization and make it just like
> >>the
> >> > other
> >> > > >> > action
> >> > > >> > > > > > types.
> >> > > >> > > > > > >  I
> >> > > >> > > > > > > > know we normally don't want to remove
> >>optimizations,
> >> but
> >> > > >> there
> >> > > >> > > are
> >> > > >> > > > > many
> >> > > >> > > > > > > > advantages in this case, and it's only saving a
> >>single
> >> > Map
> >> > > >> slot
> >> > > >> > > for
> >> > > >> > > > > MR
> >> > > >> > > > > > > jobs
> >> > > >> > > > > > > > only.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > I've created OOZIE-1483 <
> >> > > >> > > > > > > https://issues.apache.org/jira/browse/OOZIE-1483>
> >> > > >> > > > > > > > with
> >> > > >> > > > > > > > more details and should have a patch soon.
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > Thoughts?
> >> > > >> > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > thanks
> >> > > >> > > > > > > > - Robert
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > >
> >> > >
> >> >
> >>
>
>

Re: Job Recoverability

Reply via email to