Re: Job Recoverability

Robert Kanter Mon, 12 Aug 2013 16:19:57 -0700

There's also a bug that needs to be fixed:
YARN-1058<https://issues.apache.org/jira/browse/YARN-1058>.




On Mon, Aug 12, 2013 at 4:05 PM, Robert Kanter <rkan...@cloudera.com> wrote:

> Currently, job recoverability (at least for what Oozie needs) isn't quite
> there yet in YARN.  We're working on improving it in 
> YARN-1055<https://issues.apache.org/jira/browse/YARN-1055>.
>
>
> - Robert
>
>
> On Mon, Aug 12, 2013 at 2:06 PM, Rohini Palaniswamy <
> rohini.adi...@gmail.com> wrote:
>
>> Tucu,
>>     Any idea on what is the status of job recoverability with YARN? Is it
>> part of 2.1 release? Atleast I know that we don't have it supported in our
>> clusters yet. I can check with our hadoop team if not.
>>
>> Regards,
>> Rohini
>>
>> On Thu, Aug 8, 2013 at 1:30 PM, Alejandro Abdelnur <t...@cloudera.com
>> >wrote:
>>
>> > the change mentioned in 1) is a bug, a nasty one. This is a problem
>> with JT
>> > recovery turned ON or OFF and with any version of Hadoop.
>> >
>> > It has to be fixed.
>> >
>> > Also, Hadoop 1 JT job recovery is stable and works as expected.
>> >
>> > Thanks.
>> >
>> >
>> > On Thu, Aug 8, 2013 at 10:56 AM, Rohini Palaniswamy <
>> > rohini.adi...@gmail.com
>> > > wrote:
>> >
>> > > Haven't gone through the whole thread in detail yet. But looking at
>> the
>> > > change mentioned in 1), the first thing that comes to my mind is that
>> it
>> > > might not work as expected if job recoverability is not turned on. We
>> > need
>> > > to consider that case. We cannot expect everyone to be in the latest
>> > > version of hadoop and have recoverability turned on. Job
>> recoverability
>> > in
>> > > hadoop is not fully mature yet and not tested well.
>> > >
>> > > On Thu, Aug 8, 2013 at 10:17 AM, Robert Kanter <rkan...@cloudera.com>
>> > > wrote:
>> > >
>> > > > So, does this sound good?
>> > > >
>> > > > 1) Create a JIRA to make the ActionCheckXCommand leave the action
>> > RUNNING
>> > > > instead of START_MANUAL and ResumeXCommand shouldn't resubmit the
>> job
>> > > > 2) OOZIE-1483 to remove the MR optimization and set the launcher
>> job to
>> > > > recover but not the real job
>> > > >
>> > > > The property to set a job to not recover wasn't added until Hadoop
>> > 1.2.0
>> > > > and we're using 1.1.1, so we'll also need:
>> > > > 3) Create a JIRA to bump up the Hadoop version to 1.2.x
>> > > >
>> > > > There's also a problem with the DistCp action where DistCp doesn't
>> > > actually
>> > > > read the jobconf that Oozie prepares, and recoverability is enabled
>> by
>> > > > default on all jobs, so we can't disable it for the DistCp action
>> until
>> > > > DistCp is updated accordingly and we switch to a Hadoop release with
>> > that
>> > > > fix, so we'll also need:
>> > > > 4) A MAPREDUCE JIRA to make DistCp accept a jobconf
>> > > > In the meantime, this will have to be a known issue where if the JT
>> is
>> > > > restarted with recoverability, you'll end up with two hadoop jobs
>> > running
>> > > > DistCp
>> > > >
>> > > > And what should we do about the external id being the launcher job
>> > > instead
>> > > > of the real job after removing the MR optimization?
>> > > >
>> > > >
>> > > > thanks
>> > > > - Robert
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Aug 7, 2013 at 8:45 PM, Virag Kothari <vi...@yahoo-inc.com>
>> > > wrote:
>> > > >
>> > > > > Ahh..I forgot about Oozie-994. My bad, I suggested that change.
>> > > > Everything
>> > > > > makes sense now. Thanks!
>> > > > >
>> > > > > On 8/7/13 7:38 PM, "Robert Kanter" <rkan...@cloudera.com> wrote:
>> > > > >
>> > > > > >The behavior where the ActionCheckXCommand calls
>> > handleNonTransient()
>> > > > with
>> > > > > >START_MANUAL when the JT can't be reached after the retries and
>> on
>> > > > RESUME
>> > > > > >command will resubmit the job was something I did for OOZIE-994.
>>  In
>> > > > > >hindsight, we shouldn't have done it that way.
>> > > > > >
>> > > > > >Yes, it will fail if job recovery is not enabled in the JT/RM;
>> but I
>> > > > think
>> > > > > >this is the more correct behavior as this is something that the
>> > > external
>> > > > > >system should be taking care of.
>> > > > > >
>> > > > > >- Robert
>> > > > > >
>> > > > > >
>> > > > > >On Wed, Aug 7, 2013 at 5:05 PM, Virag Kothari <
>> vi...@yahoo-inc.com>
>> > > > > wrote:
>> > > > > >
>> > > > > >> Alejandro, I agree that functionality would be preserved if
>> action
>> > > is
>> > > > > >>left
>> > > > > >> in RUNNING during a transient error.
>> > > > > >>
>> > > > > >> Few questions
>> > > > > >>
>> > > > > >> 1) START_MANUAL seems to be set only by handleNonTransient().
>> If
>> > > this
>> > > > > >>is a
>> > > > > >> bug, do you know for what purpose it was introduced?
>> > > > > >>    I thought having START_MANUAL is a way to distinguish
>> between
>> > > Oozie
>> > > > > >> suspending job due to transient error and a user manually
>> > suspending
>> > > > the
>> > > > > >> job.
>> > > > > >>
>> > > > > >> 2) With no oozie retry on 'RESUME', jobs will fail if JT/RM
>> > recovery
>> > > > is
>> > > > > >> not enabled. And it seems that YARN recovery is still not
>> there as
>> > > > > >> YARN-128 is not yet committed (Not sure if looking at right
>> JIRA).
>> > > > > >>   Its a concern for us as we ask users to RESUME their jobs
>> after
>> > > > hadoop
>> > > > > >> upgrade. Now they have to resume wf and rerun the failed
>> actions.
>> > > > > >>
>> > > > > >> Thanks,
>> > > > > >> Virag
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> On 8/7/13 2:48 PM, "Alejandro Abdelnur" <t...@cloudera.com>
>> > wrote:
>> > > > > >>
>> > > > > >> >[joining the party a bit late]
>> > > > > >> >
>> > > > > >> >I just add an offline call with RobertK who brought me up to
>> > speed.
>> > > > > >> >
>> > > > > >> >By design, Oozie will retry starting a workflow action ONLY
>> if it
>> > > > > >>couldn't
>> > > > > >> >start the WF action before. If Oozie started the WF action
>> > > > > >>successfully,
>> > > > > >> >the WF action state goes into RUNNING, and from then on it is
>> the
>> > > > > >> >responsibility of the external system running the action to
>> > recover
>> > > > it.
>> > > > > >> >Oozie will not attempt any recovery after that point.
>> > > > > >> >
>> > > > > >> >This means that with  Hadoop (JT or YARN) job recovery, the
>> > > launcher
>> > > > > >>job
>> > > > > >> >will be recovered by Hadoop without any intervention from
>> Oozie.
>> > > > > >> >
>> > > > > >> >It is clear that to have recovery for  MR  action we need to
>> get
>> > > rid
>> > > > of
>> > > > > >> >the
>> > > > > >> >swap and just hold onto the MR launcher job as we do for the
>> > other
>> > > > > >> >actions.
>> > > > > >> >
>> > > > > >> >Now, on the whole discussion on the ActionCheckXCommand
>> retries.
>> > We
>> > > > > >>have a
>> > > > > >> >bug in the ActionCheckXCommand, on handleNonTransient() we
>> should
>> > > not
>> > > > > >> >change the status of the WF action to START_MANUAL, we should
>> > leave
>> > > > it
>> > > > > >>in
>> > > > > >> >RUNNING. hadnleNonTransient() will suspend the WF job thus
>> > > switching
>> > > > > >>off
>> > > > > >> >action checks. On WF job resume, the action checks will start
>> > > working
>> > > > > >> >again, and if Hadoop has job recovery, things will work fine.
>> > Else
>> > > > the
>> > > > > >>WF
>> > > > > >> >action will fail because the launcher job is not known (the
>> > > external
>> > > > > >> >system
>> > > > > >> >does not know how to recover jobs). Because we are reseting
>> the
>> > > > status
>> > > > > >>to
>> > > > > >> >START_MANUAL we are dialing back on the lifecycle of the
>> action,
>> > > that
>> > > > > >>is
>> > > > > >> >incorrect and that creates the race condition that introduces
>> 2
>> > > jobs.
>> > > > > >> >
>> > > > > >> >So again, Oozie is not responsible for recovering actions.
>> With
>> > > that
>> > > > > >> >assumption, fixing the handleNonTransient() to leave the
>> status
>> > in
>> > > > > >>RUNNING
>> > > > > >> >and getting rid of the RM swap logic we should be good.
>> > > > > >> >
>> > > > > >> >Thoughts?
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari <
>> > > vi...@yahoo-inc.com>
>> > > > > >> >wrote:
>> > > > > >> >
>> > > > > >> >> Robert,
>> > > > > >> >>
>> > > > > >> >> I have been thinking on this for a while and have few more
>> > > concerns
>> > > > > >>if
>> > > > > >> >>the
>> > > > > >> >> job retries are not streamlined through Oozie.
>> > > > > >> >>
>> > > > > >> >> 1) Till the JT finishes recovering the job, the wf job/wf
>> > action
>> > > > > >>status
>> > > > > >> >> will be SUSPENDED/START_MANUAL.
>> > > > > >> >> Isn't it misleading as the hadoop job is RUNNING while oozie
>> > > > > >>incorrectly
>> > > > > >> >> shows as SUSPENDED? Even if allow this, after the job
>> > completes,
>> > > > > >>what if
>> > > > > >> >> the callback is lost or oozie is down?
>> > > > > >> >> To prevent the job being in SUSPENDED forever, we need to
>> hack
>> > > our
>> > > > > >> >> services to pull SUSPENDED/START_MANUAL jobs from db and
>> update
>> > > > their
>> > > > > >> >> status.
>> > > > > >> >>
>> > > > > >> >> 2) Should we allow failing of the user RESUME command if the
>> > > action
>> > > > > >>is
>> > > > > >> >>in
>> > > > > >> >> START_MANUAL to prevent the race condition we were
>> discussing?
>> > > > > >> >> This would mean changing the semantics of the states.
>> > > > > >> >>
>> > > > > >> >> 3) Confused on mapred.job.restart.recover. Reading
>> > > > > >> >>
>> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html
>> > ,
>> > > it
>> > > > > >>says
>> > > > > >> >> that the default value of this is true. So,
>> > > > > >> >> if mapred.jobtracker.restart.recover (system config) is
>> already
>> > > > > >>enabled,
>> > > > > >> >> is job recovery on by default? Also, does recover mean the
>> job
>> > > will
>> > > > > >> >>start
>> > > > > >> >> where it left from or is it just plain restart?
>> > > > > >> >>
>> > > > > >> >> In summary, IMO allowing hadoop to recover jobs
>> independently
>> > > > > >>bypassing
>> > > > > >> >> Oozie ins't trivial. It would have helped if the JT produced
>> > > > > >> >>notification
>> > > > > >> >> when it comes online, so Oozie could retry after consuming
>> > those.
>> > > > But
>> > > > > >> >> currently, notification only happens when task completes.
>> > > > > >> >>
>> > > > > >> >> An alternate approach is to modify the semantics of
>> > START_MANUAL.
>> > > > > >> >> Currently Oozie puts the action/job in
>> START_MANUAL/SUSPENDED
>> > and
>> > > > > >> >>expects
>> > > > > >> >> the user to resume it. We can change this and make Oozie
>> retry
>> > > the
>> > > > > >> >> START_MANUAL actions at configurable interval (~30 mins or
>> some
>> > > > > >>scheme
>> > > > > >> >> like exp back off) . Of course, this is is bad as oozie will
>> > keep
>> > > > > >> >>polling
>> > > > > >> >> hadoop at some interval but manual resume of jobs who have
>> > faced
>> > > > > >> >>transient
>> > > > > >> >> errors will no longer be mandatory.
>> > > > > >> >>
>> > > > > >> >> --Virag
>> > > > > >> >>
>> > > > > >> >>
>> > > > > >> >> On 8/6/13 4:38 PM, "Robert Kanter" <rkan...@cloudera.com>
>> > wrote:
>> > > > > >> >>
>> > > > > >> >> >If ActionCheckX is trying to retry, and the JT recovers the
>> > job,
>> > > > > >>that
>> > > > > >> >> >should be fine.  The "retry" is to simply try connecting to
>> > the
>> > > JT
>> > > > > >>to
>> > > > > >> >>get
>> > > > > >> >> >the status for the job.  If the user issues a "RESUME" for
>> a
>> > > > > >> >>START_MANUAL
>> > > > > >> >> >job, then yes, Oozie will try to resubmit a new job for
>> that
>> > > > action
>> > > > > >>and
>> > > > > >> >> >we'd have two of them if the JT also recovers it.
>> > > > > >> >> >
>> > > > > >> >> >What if we modified the
>> > ActionStartXCommand/ResumeActionXCommand
>> > > > > >> >> >precondition to check if the action already has a Job ID
>> that
>> > is
>> > > > > >>valid
>> > > > > >> >> >(i.e. not unknown to the JT), then it fails the
>> precondition
>> > > check
>> > > > > >>or
>> > > > > >> >> >something similar?
>> > > > > >> >> >
>> > > > > >> >> >- Robert
>> > > > > >> >> >
>> > > > > >> >> >
>> > > > > >> >> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari <
>> > > > vi...@yahoo-inc.com>
>> > > > > >> >> wrote:
>> > > > > >> >> >
>> > > > > >> >> >> ActionCheckx first retries for a configurable amount of
>> time
>> > > and
>> > > > > >>then
>> > > > > >> >> >> makes the status as START_MANUAL.
>> > > > > >> >> >> So, the problem might happen when JT recovers the job
>> during
>> > > the
>> > > > > >>same
>> > > > > >> >> >>time
>> > > > > >> >> >> when 1) ActionCheckX is trying to retry or the 2) user
>> > issues
>> > > a
>> > > > > >> >>"RESUME"
>> > > > > >> >> >> for a start_manual job.
>> > > > > >> >> >> We have to ensure that this doesn't happen otherwise we
>> will
>> > > > have
>> > > > > >>two
>> > > > > >> >> >> hadoop jobs for the same action.
>> > > > > >> >> >> The callback happens only when the task is completed
>> which
>> > > might
>> > > > > >>be
>> > > > > >> >>too
>> > > > > >> >> >> late. During that time, Oozie might have already
>> submitted a
>> > > new
>> > > > > >> >>hadoop
>> > > > > >> >> >> job for that wf action.
>> > > > > >> >> >> So it doesn't seem straightforward to prevent Oozie to
>> > submit
>> > > a
>> > > > > >>new
>> > > > > >> >>job
>> > > > > >> >> >>if
>> > > > > >> >> >> the JT is already recovering the older one.
>> > > > > >> >> >>
>> > > > > >> >> >>
>> > > > > >> >> >>
>> > > > > >> >> >> On 8/6/13 4:01 PM, "Robert Kanter" <rkan...@cloudera.com
>> >
>> > > > wrote:
>> > > > > >> >> >>
>> > > > > >> >> >> >Yes, if JT recovers the job, it uses the same ID.  If
>> the
>> > JT
>> > > > > >>comes
>> > > > > >> >>up
>> > > > > >> >> >> >quickly and recovers the job, Oozie continues working
>> just
>> > > fine
>> > > > > >> >> >>(without
>> > > > > >> >> >> >the ID swap issues discussed earlier).  When the JT
>> takes
>> > > > longer
>> > > > > >> >>than
>> > > > > >> >> >>the
>> > > > > >> >> >> >10min ActionCheck interval, and the action is
>> START_MANUAL,
>> > > > that
>> > > > > >> >>still
>> > > > > >> >> >> >needs to be figured out.
>> > > > > >> >> >> >
>> > > > > >> >> >> >I haven't tested on Hadoop 2.x yet, but I've been told
>> that
>> > > it
>> > > > > >> >>should
>> > > > > >> >> >>have
>> > > > > >> >> >> >the same behavior.  The only differences are that the
>> name
>> > of
>> > > > the
>> > > > > >> >> >>property
>> > > > > >> >> >> >to enable recoverability on the server (not the
>> job-level
>> > > one)
>> > > > is
>> > > > > >> >> >> >different
>> > > > > >> >> >> >obviously because it doesn't have "jobtracker" in it
>> and it
>> > > can
>> > > > > >>also
>> > > > > >> >> >> >recover the completed tasks, which shouldn't be a
>> problem
>> > > > because
>> > > > > >> >>the
>> > > > > >> >> >> >launcher jar has the one task.  I'll of course double
>> check
>> > > > this
>> > > > > >> >> >>though.
>> > > > > >> >> >> >
>> > > > > >> >> >> >
>> > > > > >> >> >> >- Robert
>> > > > > >> >> >> >
>> > > > > >> >> >> >
>> > > > > >> >> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy
>> > > > > >> >> >> ><rohini.adi...@gmail.com>wrote:
>> > > > > >> >> >> >
>> > > > > >> >> >> >> Robert,
>> > > > > >> >> >> >>     You will not get a unknown hadoop job if JT has
>> retry
>> > > > > >> >>configured
>> > > > > >> >> >> >>right?
>> > > > > >> >> >> >> What happens in that case? Especially what happens
>> when
>> > > Oozie
>> > > > > >> >>retry
>> > > > > >> >> >> >>happens
>> > > > > >> >> >> >> when JT comes up quickly?  Also do you know what is
>> the
>> > > > > >>behaviour
>> > > > > >> >> >>with
>> > > > > >> >> >> >> Hadoop 2.x ?
>> > > > > >> >> >> >>
>> > > > > >> >> >> >> Mayank,
>> > > > > >> >> >> >>   OOZIE-1231 already has the changes to show Mapreduce
>> > job
>> > > id
>> > > > > >>in
>> > > > > >> >>the
>> > > > > >> >> >> >>Child
>> > > > > >> >> >> >> job page to be consistent with other job types. The v1
>> > API
>> > > > has
>> > > > > >>the
>> > > > > >> >> >>older
>> > > > > >> >> >> >> behaviour with map job url in externalId, while v2 API
>> > has
>> > > it
>> > > > > >>in
>> > > > > >> >> >> >> childjobids.  So there is a UI change but v1 REST API
>> has
>> > > not
>> > > > > >> >> >>changed.
>> > > > > >> >> >> >>But
>> > > > > >> >> >> >> OOZIE-1231 has not changed any code with respect to id
>> > > swap.
>> > > > > >> >> >> >>
>> > > > > >> >> >> >> Regards,
>> > > > > >> >> >> >> Rohini
>> > > > > >> >> >> >>
>> > > > > >> >> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter
>> > > > > >> >><rkan...@cloudera.com>
>> > > > > >> >> >> >> wrote:
>> > > > > >> >> >> >>
>> > > > > >> >> >> >> > Ya, I saw a precondition failed message.
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> > I just tried out what happens when the job is
>> > SUSPENDED,
>> > > > the
>> > > > > >> >> >>action is
>> > > > > >> >> >> >> > START_MANUAL, and the JT recovers the hadoop job: It
>> > > > doesn't
>> > > > > >> >> >>continue
>> > > > > >> >> >> >>the
>> > > > > >> >> >> >> > workflow.  It fails the eagerVerifyPrecondition from
>> > > > > >> >> >> >> > CompletedActionXCommand because the action isn't
>> > RUNNING.
>> > > > > >> >>Perhaps
>> > > > > >> >> >>we
>> > > > > >> >> >> >> > should make the CallbackService change the status in
>> > this
>> > > > > >> >> >>situation?
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> > Just to clarify, the above only happens when the JT
>> has
>> > > > been
>> > > > > >> >>down
>> > > > > >> >> >>long
>> > > > > >> >> >> >> > enough that the ActionCheckXCommand (every 10min by
>> > > > default)
>> > > > > >>+
>> > > > > >> >>the
>> > > > > >> >> >> >> retries
>> > > > > >> >> >> >> > (3 x 1min) happen.  If it comes back sooner than
>> that,
>> > > > > >> >>everything
>> > > > > >> >> >> >>works
>> > > > > >> >> >> >> > fine.
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> > thanks
>> > > > > >> >> >> >> > - Robert
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari
>> > > > > >> >><vi...@yahoo-inc.com
>> > > > > >> >> >
>> > > > > >> >> >> >> wrote:
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> > > Oh..okay. Seems like RecoveryService queues the
>> > StartX
>> > > > > >>command
>> > > > > >> >> >>but
>> > > > > >> >> >> >>the
>> > > > > >> >> >> >> > > verifyPrecondition() fails as the wf job is
>> > > > > >> >> >> >> > > Suspended (Plz verify this from logs).
>> > > > > >> >> >> >> > >
>> > > > > >> >> >> >> > > In that case, if Oozie is not auto-retrying and
>> > > > > >>resubmitting,
>> > > > > >> >> >>then
>> > > > > >> >> >> >>it
>> > > > > >> >> >> >> > > seems fair to have the JT recover the job.
>> > > > > >> >> >> >> > > But if JT recovers the job, can we make sure that
>> the
>> > > > > >>workflow
>> > > > > >> >> >>job
>> > > > > >> >> >> >> > > transits to RUNNING from SUSPENDED and wf action
>> from
>> > > > > >> >> >>START_MANUAL
>> > > > > >> >> >> >>to
>> > > > > >> >> >> >> > > RUNNING?
>> > > > > >> >> >> >> > > It should not happen that the user resumes the job
>> > > which
>> > > > > >>makes
>> > > > > >> >> >>Oozie
>> > > > > >> >> >> >> > > submit a new hadoop job while the JT is also
>> > recovering
>> > > > the
>> > > > > >> >>same
>> > > > > >> >> >> >>job.
>> > > > > >> >> >> >> > > Also, I think the error can still be considered
>> > > transient
>> > > > > >>from
>> > > > > >> >> >>Oozie
>> > > > > >> >> >> >> > > perspective as it is temporary depending on state
>> of
>> > > JT.
>> > > > > >> >> >> >> > >
>> > > > > >> >> >> >> > > Thanks,
>> > > > > >> >> >> >> > > Virag
>> > > > > >> >> >> >> > >
>> > > > > >> >> >> >> > >
>> > > > > >> >> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" <
>> > > rkan...@cloudera.com
>> > > > >
>> > > > > >> >>wrote:
>> > > > > >> >> >> >> > >
>> > > > > >> >> >> >> > > >Virag,
>> > > > > >> >> >> >> > > >I just tested out killing the JT and waiting for
>> the
>> > > > > >>Checker
>> > > > > >> >> >> >>service
>> > > > > >> >> >> >> to
>> > > > > >> >> >> >> > > >retry and give up: the action goes to
>> START_MANUAL
>> > and
>> > > > the
>> > > > > >> >>job
>> > > > > >> >> >>gets
>> > > > > >> >> >> >> > > >SUSPENDED.  I waited around long enough, but the
>> > > > > >> >>RecoveryService
>> > > > > >> >> >> >> didn't
>> > > > > >> >> >> >> > do
>> > > > > >> >> >> >> > > >anything.  Does it kick in for you?  As a side
>> note,
>> > > > > >>looking
>> > > > > >> >>at
>> > > > > >> >> >>the
>> > > > > >> >> >> >> > code,
>> > > > > >> >> >> >> > > >the RecoveryService looks like it can handle
>> > > > START_MANUAL,
>> > > > > >> >> >> >>END_MANUAL,
>> > > > > >> >> >> >> > and
>> > > > > >> >> >> >> > > >USER_RETRY, which all sound like things the user
>> > > should
>> > > > be
>> > > > > >> >> >>doing;
>> > > > > >> >> >> >>is
>> > > > > >> >> >> >> it
>> > > > > >> >> >> >> > > >correct that RecoveryService is handling these?
>> > > > > >> >> >> >> > > >The Unknown Hadoop Job error happens when the JT
>> > comes
>> > > > > >>back
>> > > > > >> >>in
>> > > > > >> >> >>time
>> > > > > >> >> >> >> > > >because
>> > > > > >> >> >> >> > > >it won't know about the old ID if its not
>> recovering
>> > > > jobs.
>> > > > > >> >>So,
>> > > > > >> >> >> >>Oozie
>> > > > > >> >> >> >> > > >tries
>> > > > > >> >> >> >> > > >to ask it about a job that no longer exists.  I'm
>> > not
>> > > > sure
>> > > > > >> >>that
>> > > > > >> >> >> >>this
>> > > > > >> >> >> >> > > >should
>> > > > > >> >> >> >> > > >be a transient error because there's no way to
>> > > determine
>> > > > > >>if
>> > > > > >> >>its
>> > > > > >> >> >> >> because
>> > > > > >> >> >> >> > > >the
>> > > > > >> >> >> >> > > >JT restarted and Oozie should resubmit the job
>> or if
>> > > > > >> >>something
>> > > > > >> >> >>else
>> > > > > >> >> >> >> > > >happened.
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >Mayank,
>> > > > > >> >> >> >> > > >That is a good point.  We could either make a v3
>> API
>> > > or
>> > > > > >>add
>> > > > > >> >>an
>> > > > > >> >> >> >> > oozie-site
>> > > > > >> >> >> >> > > >config to turn on/off the id swap behavior and
>> keep
>> > > the
>> > > > v2
>> > > > > >> >>API.
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >thanks
>> > > > > >> >> >> >> > > >- Robert
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal
>> > > > > >> >> >><may...@apache.org>
>> > > > > >> >> >> >> > wrote:
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >> Robert,
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >> Thats a break in backward compatibility. Till
>> now
>> > > user
>> > > > > >>are
>> > > > > >> >> >>used
>> > > > > >> >> >> >>to
>> > > > > >> >> >> >> > > >>click on
>> > > > > >> >> >> >> > > >> to link to go to MR page.
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >> Is there a better way to handle this?
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >> Thanks,
>> > > > > >> >> >> >> > > >> Mayank
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter
>> <
>> > > > > >> >> >> >> rkan...@cloudera.com>
>> > > > > >> >> >> >> > > >> wrote:
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > > >> > Mona,
>> > > > > >> >> >> >> > > >> > As far as I'm aware, the "retry" that Oozie
>> is
>> > > doing
>> > > > > >>is
>> > > > > >> >>just
>> > > > > >> >> >> >> > retrying
>> > > > > >> >> >> >> > > >>to
>> > > > > >> >> >> >> > > >> > connect to the JT (which is why when the JT
>> > comes
>> > > > back
>> > > > > >> >>up,
>> > > > > >> >> >> >>Oozie
>> > > > > >> >> >> >> > > >> > can continue monitoring the hadoop job if it
>> > still
>> > > > has
>> > > > > >> >>the
>> > > > > >> >> >>same
>> > > > > >> >> >> >> ID);
>> > > > > >> >> >> >> > > >>it
>> > > > > >> >> >> >> > > >> > doesn't try to submit the job again as part
>> of
>> > the
>> > > > > >> >>"retry".
>> > > > > >> >> >> >> > > >> >
>> > > > > >> >> >> >> > > >> > Mayank,
>> > > > > >> >> >> >> > > >> > We can put the ID for the actual job in the
>> > Child
>> > > > IDs
>> > > > > >>tab
>> > > > > >> >> >>(like
>> > > > > >> >> >> >> with
>> > > > > >> >> >> >> > > >> Pig).
>> > > > > >> >> >> >> > > >> >
>> > > > > >> >> >> >> > > >> >
>> > > > > >> >> >> >> > > >> > - Robert
>> > > > > >> >> >> >> > > >> >
>> > > > > >> >> >> >> > > >> >
>> > > > > >> >> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank
>> Bansal
>> > > > > >> >> >> >><may...@apache.org
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >> > > >> wrote:
>> > > > > >> >> >> >> > > >> >
>> > > > > >> >> >> >> > > >> > > I agree , we should handle these two
>> > scenarios,
>> > > I
>> > > > > >>am ok
>> > > > > >> >> >>with
>> > > > > >> >> >> >> > > >>changing
>> > > > > >> >> >> >> > > >> the
>> > > > > >> >> >> >> > > >> > > launcher behavior for MR however if we
>> remove
>> > > the
>> > > > id
>> > > > > >> >>swap
>> > > > > >> >> >> >>then
>> > > > > >> >> >> >> how
>> > > > > >> >> >> >> > > >>we
>> > > > > >> >> >> >> > > >> > > nevigate to MR jobs from UI as we do right
>> > now?
>> > > > > >> >> >> >> > > >> > >
>> > > > > >> >> >> >> > > >> > > Thanks,
>> > > > > >> >> >> >> > > >> > > Mayank
>> > > > > >> >> >> >> > > >> > >
>> > > > > >> >> >> >> > > >> > >
>> > > > > >> >> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert
>> Kanter
>> > > > > >> >> >> >> > > >><rkan...@cloudera.com>
>> > > > > >> >> >> >> > > >> > > wrote:
>> > > > > >> >> >> >> > > >> > >
>> > > > > >> >> >> >> > > >> > > > Suppose we leave the MR ID swap thing as
>> is
>> > > but
>> > > > > >>set
>> > > > > >> >>the
>> > > > > >> >> >> >> launcher
>> > > > > >> >> >> >> > > >> > recover
>> > > > > >> >> >> >> > > >> > > to
>> > > > > >> >> >> >> > > >> > > > 0 and job to 1; then consider these two
>> > > > scenarios:
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > > 1. JT gets restarted during the launcher
>> job
>> > > but
>> > > > > >> >>before
>> > > > > >> >> >>the
>> > > > > >> >> >> >> > > >>launcher
>> > > > > >> >> >> >> > > >> > job
>> > > > > >> >> >> >> > > >> > > > actually launches the real job:
>> > > > > >> >> >> >> > > >> > > >      - The launcher job won't be
>> recovered
>> > > > > >>because we
>> > > > > >> >> >>told
>> > > > > >> >> >> >>it
>> > > > > >> >> >> >> > not
>> > > > > >> >> >> >> > > >>to
>> > > > > >> >> >> >> > > >> > > >      - The real job was never launched
>> > > > > >> >> >> >> > > >> > > >      ---> Action never completes and
>> Oozie
>> > > marks
>> > > > > >>it
>> > > > > >> >>as
>> > > > > >> >> >> >>failed
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > > 2. Launcher job submits the real job,
>> but JT
>> > > > gets
>> > > > > >> >> >>restarted
>> > > > > >> >> >> >> > before
>> > > > > >> >> >> >> > > >> the
>> > > > > >> >> >> >> > > >> > > > Oozie server has a chance to swap IDs
>> (its
>> > not
>> > > > an
>> > > > > >> >>atomic
>> > > > > >> >> >> >> > > >>operation):
>> > > > > >> >> >> >> > > >> > > >      - The launcher job won't be
>> recovered
>> > > > > >>because we
>> > > > > >> >> >>told
>> > > > > >> >> >> >>it
>> > > > > >> >> >> >> > not
>> > > > > >> >> >> >> > > >>to
>> > > > > >> >> >> >> > > >> > > >      - The real job will be recovered and
>> > > finish
>> > > > > >> >> >> >>successfully
>> > > > > >> >> >> >> > > >> > > >      ---> Oozie marks the action as
>> failed
>> > > even
>> > > > > >> >>though
>> > > > > >> >> >>the
>> > > > > >> >> >> >> > actual
>> > > > > >> >> >> >> > > >>job
>> > > > > >> >> >> >> > > >> > > > succeeded because it didn't know about
>> the
>> > ID
>> > > > swap
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > > It would only work for the case where
>> the JT
>> > > > gets
>> > > > > >> >> >>restarted
>> > > > > >> >> >> >> > after
>> > > > > >> >> >> >> > > >>the
>> > > > > >> >> >> >> > > >> > ID
>> > > > > >> >> >> >> > > >> > > > swap occurs.
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > > - Robert
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank
>> > > Bansal <
>> > > > > >> >> >> >> > may...@apache.org
>> > > > > >> >> >> >> > > >
>> > > > > >> >> >> >> > > >> > > wrote:
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > > > > Hi Robert,
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and
>> 0 to
>> > > > jobs
>> > > > > >>for
>> > > > > >> >> >> >>recovery
>> > > > > >> >> >> >> > in
>> > > > > >> >> >> >> > > >>all
>> > > > > >> >> >> >> > > >> > the
>> > > > > >> >> >> >> > > >> > > > > cases except MR.
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > > As after Id swapped Oozie only know
>> about
>> > MR
>> > > > job
>> > > > > >> >>isn't
>> > > > > >> >> >> >>it?
>> > > > > >> >> >> >> > then
>> > > > > >> >> >> >> > > >> there
>> > > > > >> >> >> >> > > >> > > > > should not be any problem.
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > > If we set MR launcher recover to 0 and
>> job
>> > > to
>> > > > 1
>> > > > > >> >>then
>> > > > > >> >> >>job
>> > > > > >> >> >> >> will
>> > > > > >> >> >> >> > be
>> > > > > >> >> >> >> > > >> > > succeded
>> > > > > >> >> >> >> > > >> > > > > in case of JT restart.
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > > AM I missing something?
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > > Thanks,
>> > > > > >> >> >> >> > > >> > > > > Mayank
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert
>> > > Kanter
>> > > > <
>> > > > > >> >> >> >> > > >> rkan...@cloudera.com>
>> > > > > >> >> >> >> > > >> > > > > wrote:
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > > > > I think you usually just get the
>> > "Unknown
>> > > > > >>Hadoop
>> > > > > >> >> >>Job"
>> > > > > >> >> >> >> error
>> > > > > >> >> >> >> > > >> message
>> > > > > >> >> >> >> > > >> > > > > because
>> > > > > >> >> >> >> > > >> > > > > > Oozie tries to look up the Hadoop
>> Job ID
>> > > it
>> > > > > >> >>already
>> > > > > >> >> >> >>has,
>> > > > > >> >> >> >> but
>> > > > > >> >> >> >> > > >>the
>> > > > > >> >> >> >> > > >> JT
>> > > > > >> >> >> >> > > >> > > no
>> > > > > >> >> >> >> > > >> > > > > > longer has that ID because it was
>> > > restarted.
>> > > > > >> >>With
>> > > > > >> >> >>JT
>> > > > > >> >> >> >> > > >> > Recoverability
>> > > > > >> >> >> >> > > >> > > > > turned
>> > > > > >> >> >> >> > > >> > > > > > on, it will restart the job using the
>> > same
>> > > > > >>ID, so
>> > > > > >> >> >>Oozie
>> > > > > >> >> >> >> > > >>continues
>> > > > > >> >> >> >> > > >> > > just
>> > > > > >> >> >> >> > > >> > > > > > fine.
>> > > > > >> >> >> >> > > >> > > > > >
>> > > > > >> >> >> >> > > >> > > > > > - Robert
>> > > > > >> >> >> >> > > >> > > > > >
>> > > > > >> >> >> >> > > >> > > > > >
>> > > > > >> >> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM,
>> Rohini
>> > > > > >> >>Palaniswamy
>> > > > > >> >> >> >> > > >> > > > > > <rohini.adi...@gmail.com>wrote:
>> > > > > >> >> >> >> > > >> > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > Wouldn't oozie poll for the job
>> status
>> > > and
>> > > > > >> >>decide
>> > > > > >> >> >> >>that
>> > > > > >> >> >> >> it
>> > > > > >> >> >> >> > > >>has
>> > > > > >> >> >> >> > > >> > > failed
>> > > > > >> >> >> >> > > >> > > > > and
>> > > > > >> >> >> >> > > >> > > > > > > when JT comes up launch another
>> one if
>> > > > > >>retry is
>> > > > > >> >> >> >> > configured?
>> > > > > >> >> >> >> > > >> > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM,
>> Robert
>> > > > > >>Kanter <
>> > > > > >> >> >> >> > > >> > > rkan...@cloudera.com>
>> > > > > >> >> >> >> > > >> > > > > > > wrote:
>> > > > > >> >> >> >> > > >> > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > > Hi,
>> > > > > >> >> >> >> > > >> > > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > > We looked into how to support Job
>> > > > > >> >>Recoverability
>> > > > > >> >> >> >>(i.e.
>> > > > > >> >> >> >> > > >>the JT
>> > > > > >> >> >> >> > > >> > is
>> > > > > >> >> >> >> > > >> > > > > > > restarted
>> > > > > >> >> >> >> > > >> > > > > > > > and it wants to restart the jobs
>> > that
>> > > > were
>> > > > > >> >> >>running;
>> > > > > >> >> >> >> > > >>similarly
>> > > > > >> >> >> >> > > >> > for
>> > > > > >> >> >> >> > > >> > > > > YARN)
>> > > > > >> >> >> >> > > >> > > > > > > and
>> > > > > >> >> >> >> > > >> > > > > > > > have a pretty simple solution for
>> > all
>> > > of
>> > > > > >>the
>> > > > > >> >> >>action
>> > > > > >> >> >> >> > types
>> > > > > >> >> >> >> > > >> > except
>> > > > > >> >> >> >> > > >> > > > for
>> > > > > >> >> >> >> > > >> > > > > > > > MapReduce.  If we set
>> > > > > >> >> >> >> mapreduce.job.restart.recover=true
>> > > > > >> >> >> >> > > >>for
>> > > > > >> >> >> >> > > >> > the
>> > > > > >> >> >> >> > > >> > > > > > launcher
>> > > > > >> >> >> >> > > >> > > > > > > > job and
>> > > > > >>mapreduce.job.restart.recover=false
>> > > > > >> >>for
>> > > > > >> >> >>the
>> > > > > >> >> >> >> jobs
>> > > > > >> >> >> >> > > >> > launched
>> > > > > >> >> >> >> > > >> > > > by
>> > > > > >> >> >> >> > > >> > > > > > the
>> > > > > >> >> >> >> > > >> > > > > > > > launcher, then when the JT
>> restarts,
>> > > it
>> > > > > >>will
>> > > > > >> >> >> >>recover
>> > > > > >> >> >> >> the
>> > > > > >> >> >> >> > > >> > launcher
>> > > > > >> >> >> >> > > >> > > > job
>> > > > > >> >> >> >> > > >> > > > > > but
>> > > > > >> >> >> >> > > >> > > > > > > > not the child jobs -- the
>> launcher
>> > job
>> > > > > >>will
>> > > > > >> >>then
>> > > > > >> >> >> >>take
>> > > > > >> >> >> >> > > >>care of
>> > > > > >> >> >> >> > > >> > > > > > relaunching
>> > > > > >> >> >> >> > > >> > > > > > > > the child jobs.
>> > > > > >> >> >> >> > > >> > > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > > For MapReduce, because of the
>> > > > optimization
>> > > > > >> >>with
>> > > > > >> >> >> >>the id
>> > > > > >> >> >> >> > > >>swap,
>> > > > > >> >> >> >> > > >> > this
>> > > > > >> >> >> >> > > >> > > > > won't
>> > > > > >> >> >> >> > > >> > > > > > > > work.  It would be very tricky,
>> if
>> > > it's
>> > > > > >>even
>> > > > > >> >> >> >> practical,
>> > > > > >> >> >> >> > > >>to do
>> > > > > >> >> >> >> > > >> > > > > something
>> > > > > >> >> >> >> > > >> > > > > > > > similar for the MR action.
>>  Instead,
>> > > we
>> > > > > >> >>think it
>> > > > > >> >> >> >>would
>> > > > > >> >> >> >> > be
>> > > > > >> >> >> >> > > >> best
>> > > > > >> >> >> >> > > >> > if
>> > > > > >> >> >> >> > > >> > > > we
>> > > > > >> >> >> >> > > >> > > > > > > simply
>> > > > > >> >> >> >> > > >> > > > > > > > remove the MR optimization and
>> make
>> > it
>> > > > > >>just
>> > > > > >> >>like
>> > > > > >> >> >> >>the
>> > > > > >> >> >> >> > other
>> > > > > >> >> >> >> > > >> > action
>> > > > > >> >> >> >> > > >> > > > > > types.
>> > > > > >> >> >> >> > > >> > > > > > >  I
>> > > > > >> >> >> >> > > >> > > > > > > > know we normally don't want to
>> > remove
>> > > > > >> >> >> >>optimizations,
>> > > > > >> >> >> >> but
>> > > > > >> >> >> >> > > >> there
>> > > > > >> >> >> >> > > >> > > are
>> > > > > >> >> >> >> > > >> > > > > many
>> > > > > >> >> >> >> > > >> > > > > > > > advantages in this case, and it's
>> > only
>> > > > > >> >>saving a
>> > > > > >> >> >> >>single
>> > > > > >> >> >> >> > Map
>> > > > > >> >> >> >> > > >> slot
>> > > > > >> >> >> >> > > >> > > for
>> > > > > >> >> >> >> > > >> > > > > MR
>> > > > > >> >> >> >> > > >> > > > > > > jobs
>> > > > > >> >> >> >> > > >> > > > > > > > only.
>> > > > > >> >> >> >> > > >> > > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > > I've created OOZIE-1483 <
>> > > > > >> >> >> >> > > >> > > > > > >
>> > > > > >> >>https://issues.apache.org/jira/browse/OOZIE-1483>
>> > > > > >> >> >> >> > > >> > > > > > > > with
>> > > > > >> >> >> >> > > >> > > > > > > > more details and should have a
>> patch
>> > > > soon.
>> > > > > >> >> >> >> > > >> > > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > > Thoughts?
>> > > > > >> >> >> >> > > >> > > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > > > thanks
>> > > > > >> >> >> >> > > >> > > > > > > > - Robert
>> > > > > >> >> >> >> > > >> > > > > > > >
>> > > > > >> >> >> >> > > >> > > > > > >
>> > > > > >> >> >> >> > > >> > > > > >
>> > > > > >> >> >> >> > > >> > > > >
>> > > > > >> >> >> >> > > >> > > >
>> > > > > >> >> >> >> > > >> > >
>> > > > > >> >> >> >> > > >> >
>> > > > > >> >> >> >> > > >>
>> > > > > >> >> >> >> > >
>> > > > > >> >> >> >> > >
>> > > > > >> >> >> >> >
>> > > > > >> >> >> >>
>> > > > > >> >> >>
>> > > > > >> >> >>
>> > > > > >> >>
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >--
>> > > > > >> >Alejandro
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Alejandro
>> >
>>
>
>

Re: Job Recoverability

Reply via email to