Tucu, Any idea on what is the status of job recoverability with YARN? Is it part of 2.1 release? Atleast I know that we don't have it supported in our clusters yet. I can check with our hadoop team if not.
Regards, Rohini On Thu, Aug 8, 2013 at 1:30 PM, Alejandro Abdelnur <t...@cloudera.com>wrote: > the change mentioned in 1) is a bug, a nasty one. This is a problem with JT > recovery turned ON or OFF and with any version of Hadoop. > > It has to be fixed. > > Also, Hadoop 1 JT job recovery is stable and works as expected. > > Thanks. > > > On Thu, Aug 8, 2013 at 10:56 AM, Rohini Palaniswamy < > rohini.adi...@gmail.com > > wrote: > > > Haven't gone through the whole thread in detail yet. But looking at the > > change mentioned in 1), the first thing that comes to my mind is that it > > might not work as expected if job recoverability is not turned on. We > need > > to consider that case. We cannot expect everyone to be in the latest > > version of hadoop and have recoverability turned on. Job recoverability > in > > hadoop is not fully mature yet and not tested well. > > > > On Thu, Aug 8, 2013 at 10:17 AM, Robert Kanter <rkan...@cloudera.com> > > wrote: > > > > > So, does this sound good? > > > > > > 1) Create a JIRA to make the ActionCheckXCommand leave the action > RUNNING > > > instead of START_MANUAL and ResumeXCommand shouldn't resubmit the job > > > 2) OOZIE-1483 to remove the MR optimization and set the launcher job to > > > recover but not the real job > > > > > > The property to set a job to not recover wasn't added until Hadoop > 1.2.0 > > > and we're using 1.1.1, so we'll also need: > > > 3) Create a JIRA to bump up the Hadoop version to 1.2.x > > > > > > There's also a problem with the DistCp action where DistCp doesn't > > actually > > > read the jobconf that Oozie prepares, and recoverability is enabled by > > > default on all jobs, so we can't disable it for the DistCp action until > > > DistCp is updated accordingly and we switch to a Hadoop release with > that > > > fix, so we'll also need: > > > 4) A MAPREDUCE JIRA to make DistCp accept a jobconf > > > In the meantime, this will have to be a known issue where if the JT is > > > restarted with recoverability, you'll end up with two hadoop jobs > running > > > DistCp > > > > > > And what should we do about the external id being the launcher job > > instead > > > of the real job after removing the MR optimization? > > > > > > > > > thanks > > > - Robert > > > > > > > > > > > > > > > On Wed, Aug 7, 2013 at 8:45 PM, Virag Kothari <vi...@yahoo-inc.com> > > wrote: > > > > > > > Ahh..I forgot about Oozie-994. My bad, I suggested that change. > > > Everything > > > > makes sense now. Thanks! > > > > > > > > On 8/7/13 7:38 PM, "Robert Kanter" <rkan...@cloudera.com> wrote: > > > > > > > > >The behavior where the ActionCheckXCommand calls > handleNonTransient() > > > with > > > > >START_MANUAL when the JT can't be reached after the retries and on > > > RESUME > > > > >command will resubmit the job was something I did for OOZIE-994. In > > > > >hindsight, we shouldn't have done it that way. > > > > > > > > > >Yes, it will fail if job recovery is not enabled in the JT/RM; but I > > > think > > > > >this is the more correct behavior as this is something that the > > external > > > > >system should be taking care of. > > > > > > > > > >- Robert > > > > > > > > > > > > > > >On Wed, Aug 7, 2013 at 5:05 PM, Virag Kothari <vi...@yahoo-inc.com> > > > > wrote: > > > > > > > > > >> Alejandro, I agree that functionality would be preserved if action > > is > > > > >>left > > > > >> in RUNNING during a transient error. > > > > >> > > > > >> Few questions > > > > >> > > > > >> 1) START_MANUAL seems to be set only by handleNonTransient(). If > > this > > > > >>is a > > > > >> bug, do you know for what purpose it was introduced? > > > > >> I thought having START_MANUAL is a way to distinguish between > > Oozie > > > > >> suspending job due to transient error and a user manually > suspending > > > the > > > > >> job. > > > > >> > > > > >> 2) With no oozie retry on 'RESUME', jobs will fail if JT/RM > recovery > > > is > > > > >> not enabled. And it seems that YARN recovery is still not there as > > > > >> YARN-128 is not yet committed (Not sure if looking at right JIRA). > > > > >> Its a concern for us as we ask users to RESUME their jobs after > > > hadoop > > > > >> upgrade. Now they have to resume wf and rerun the failed actions. > > > > >> > > > > >> Thanks, > > > > >> Virag > > > > >> > > > > >> > > > > >> > > > > >> On 8/7/13 2:48 PM, "Alejandro Abdelnur" <t...@cloudera.com> > wrote: > > > > >> > > > > >> >[joining the party a bit late] > > > > >> > > > > > >> >I just add an offline call with RobertK who brought me up to > speed. > > > > >> > > > > > >> >By design, Oozie will retry starting a workflow action ONLY if it > > > > >>couldn't > > > > >> >start the WF action before. If Oozie started the WF action > > > > >>successfully, > > > > >> >the WF action state goes into RUNNING, and from then on it is the > > > > >> >responsibility of the external system running the action to > recover > > > it. > > > > >> >Oozie will not attempt any recovery after that point. > > > > >> > > > > > >> >This means that with Hadoop (JT or YARN) job recovery, the > > launcher > > > > >>job > > > > >> >will be recovered by Hadoop without any intervention from Oozie. > > > > >> > > > > > >> >It is clear that to have recovery for MR action we need to get > > rid > > > of > > > > >> >the > > > > >> >swap and just hold onto the MR launcher job as we do for the > other > > > > >> >actions. > > > > >> > > > > > >> >Now, on the whole discussion on the ActionCheckXCommand retries. > We > > > > >>have a > > > > >> >bug in the ActionCheckXCommand, on handleNonTransient() we should > > not > > > > >> >change the status of the WF action to START_MANUAL, we should > leave > > > it > > > > >>in > > > > >> >RUNNING. hadnleNonTransient() will suspend the WF job thus > > switching > > > > >>off > > > > >> >action checks. On WF job resume, the action checks will start > > working > > > > >> >again, and if Hadoop has job recovery, things will work fine. > Else > > > the > > > > >>WF > > > > >> >action will fail because the launcher job is not known (the > > external > > > > >> >system > > > > >> >does not know how to recover jobs). Because we are reseting the > > > status > > > > >>to > > > > >> >START_MANUAL we are dialing back on the lifecycle of the action, > > that > > > > >>is > > > > >> >incorrect and that creates the race condition that introduces 2 > > jobs. > > > > >> > > > > > >> >So again, Oozie is not responsible for recovering actions. With > > that > > > > >> >assumption, fixing the handleNonTransient() to leave the status > in > > > > >>RUNNING > > > > >> >and getting rid of the RM swap logic we should be good. > > > > >> > > > > > >> >Thoughts? > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> >On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari < > > vi...@yahoo-inc.com> > > > > >> >wrote: > > > > >> > > > > > >> >> Robert, > > > > >> >> > > > > >> >> I have been thinking on this for a while and have few more > > concerns > > > > >>if > > > > >> >>the > > > > >> >> job retries are not streamlined through Oozie. > > > > >> >> > > > > >> >> 1) Till the JT finishes recovering the job, the wf job/wf > action > > > > >>status > > > > >> >> will be SUSPENDED/START_MANUAL. > > > > >> >> Isn't it misleading as the hadoop job is RUNNING while oozie > > > > >>incorrectly > > > > >> >> shows as SUSPENDED? Even if allow this, after the job > completes, > > > > >>what if > > > > >> >> the callback is lost or oozie is down? > > > > >> >> To prevent the job being in SUSPENDED forever, we need to hack > > our > > > > >> >> services to pull SUSPENDED/START_MANUAL jobs from db and update > > > their > > > > >> >> status. > > > > >> >> > > > > >> >> 2) Should we allow failing of the user RESUME command if the > > action > > > > >>is > > > > >> >>in > > > > >> >> START_MANUAL to prevent the race condition we were discussing? > > > > >> >> This would mean changing the semantics of the states. > > > > >> >> > > > > >> >> 3) Confused on mapred.job.restart.recover. Reading > > > > >> >> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html > , > > it > > > > >>says > > > > >> >> that the default value of this is true. So, > > > > >> >> if mapred.jobtracker.restart.recover (system config) is already > > > > >>enabled, > > > > >> >> is job recovery on by default? Also, does recover mean the job > > will > > > > >> >>start > > > > >> >> where it left from or is it just plain restart? > > > > >> >> > > > > >> >> In summary, IMO allowing hadoop to recover jobs independently > > > > >>bypassing > > > > >> >> Oozie ins't trivial. It would have helped if the JT produced > > > > >> >>notification > > > > >> >> when it comes online, so Oozie could retry after consuming > those. > > > But > > > > >> >> currently, notification only happens when task completes. > > > > >> >> > > > > >> >> An alternate approach is to modify the semantics of > START_MANUAL. > > > > >> >> Currently Oozie puts the action/job in START_MANUAL/SUSPENDED > and > > > > >> >>expects > > > > >> >> the user to resume it. We can change this and make Oozie retry > > the > > > > >> >> START_MANUAL actions at configurable interval (~30 mins or some > > > > >>scheme > > > > >> >> like exp back off) . Of course, this is is bad as oozie will > keep > > > > >> >>polling > > > > >> >> hadoop at some interval but manual resume of jobs who have > faced > > > > >> >>transient > > > > >> >> errors will no longer be mandatory. > > > > >> >> > > > > >> >> --Virag > > > > >> >> > > > > >> >> > > > > >> >> On 8/6/13 4:38 PM, "Robert Kanter" <rkan...@cloudera.com> > wrote: > > > > >> >> > > > > >> >> >If ActionCheckX is trying to retry, and the JT recovers the > job, > > > > >>that > > > > >> >> >should be fine. The "retry" is to simply try connecting to > the > > JT > > > > >>to > > > > >> >>get > > > > >> >> >the status for the job. If the user issues a "RESUME" for a > > > > >> >>START_MANUAL > > > > >> >> >job, then yes, Oozie will try to resubmit a new job for that > > > action > > > > >>and > > > > >> >> >we'd have two of them if the JT also recovers it. > > > > >> >> > > > > > >> >> >What if we modified the > ActionStartXCommand/ResumeActionXCommand > > > > >> >> >precondition to check if the action already has a Job ID that > is > > > > >>valid > > > > >> >> >(i.e. not unknown to the JT), then it fails the precondition > > check > > > > >>or > > > > >> >> >something similar? > > > > >> >> > > > > > >> >> >- Robert > > > > >> >> > > > > > >> >> > > > > > >> >> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari < > > > vi...@yahoo-inc.com> > > > > >> >> wrote: > > > > >> >> > > > > > >> >> >> ActionCheckx first retries for a configurable amount of time > > and > > > > >>then > > > > >> >> >> makes the status as START_MANUAL. > > > > >> >> >> So, the problem might happen when JT recovers the job during > > the > > > > >>same > > > > >> >> >>time > > > > >> >> >> when 1) ActionCheckX is trying to retry or the 2) user > issues > > a > > > > >> >>"RESUME" > > > > >> >> >> for a start_manual job. > > > > >> >> >> We have to ensure that this doesn't happen otherwise we will > > > have > > > > >>two > > > > >> >> >> hadoop jobs for the same action. > > > > >> >> >> The callback happens only when the task is completed which > > might > > > > >>be > > > > >> >>too > > > > >> >> >> late. During that time, Oozie might have already submitted a > > new > > > > >> >>hadoop > > > > >> >> >> job for that wf action. > > > > >> >> >> So it doesn't seem straightforward to prevent Oozie to > submit > > a > > > > >>new > > > > >> >>job > > > > >> >> >>if > > > > >> >> >> the JT is already recovering the older one. > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> On 8/6/13 4:01 PM, "Robert Kanter" <rkan...@cloudera.com> > > > wrote: > > > > >> >> >> > > > > >> >> >> >Yes, if JT recovers the job, it uses the same ID. If the > JT > > > > >>comes > > > > >> >>up > > > > >> >> >> >quickly and recovers the job, Oozie continues working just > > fine > > > > >> >> >>(without > > > > >> >> >> >the ID swap issues discussed earlier). When the JT takes > > > longer > > > > >> >>than > > > > >> >> >>the > > > > >> >> >> >10min ActionCheck interval, and the action is START_MANUAL, > > > that > > > > >> >>still > > > > >> >> >> >needs to be figured out. > > > > >> >> >> > > > > > >> >> >> >I haven't tested on Hadoop 2.x yet, but I've been told that > > it > > > > >> >>should > > > > >> >> >>have > > > > >> >> >> >the same behavior. The only differences are that the name > of > > > the > > > > >> >> >>property > > > > >> >> >> >to enable recoverability on the server (not the job-level > > one) > > > is > > > > >> >> >> >different > > > > >> >> >> >obviously because it doesn't have "jobtracker" in it and it > > can > > > > >>also > > > > >> >> >> >recover the completed tasks, which shouldn't be a problem > > > because > > > > >> >>the > > > > >> >> >> >launcher jar has the one task. I'll of course double check > > > this > > > > >> >> >>though. > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> >- Robert > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy > > > > >> >> >> ><rohini.adi...@gmail.com>wrote: > > > > >> >> >> > > > > > >> >> >> >> Robert, > > > > >> >> >> >> You will not get a unknown hadoop job if JT has retry > > > > >> >>configured > > > > >> >> >> >>right? > > > > >> >> >> >> What happens in that case? Especially what happens when > > Oozie > > > > >> >>retry > > > > >> >> >> >>happens > > > > >> >> >> >> when JT comes up quickly? Also do you know what is the > > > > >>behaviour > > > > >> >> >>with > > > > >> >> >> >> Hadoop 2.x ? > > > > >> >> >> >> > > > > >> >> >> >> Mayank, > > > > >> >> >> >> OOZIE-1231 already has the changes to show Mapreduce > job > > id > > > > >>in > > > > >> >>the > > > > >> >> >> >>Child > > > > >> >> >> >> job page to be consistent with other job types. The v1 > API > > > has > > > > >>the > > > > >> >> >>older > > > > >> >> >> >> behaviour with map job url in externalId, while v2 API > has > > it > > > > >>in > > > > >> >> >> >> childjobids. So there is a UI change but v1 REST API has > > not > > > > >> >> >>changed. > > > > >> >> >> >>But > > > > >> >> >> >> OOZIE-1231 has not changed any code with respect to id > > swap. > > > > >> >> >> >> > > > > >> >> >> >> Regards, > > > > >> >> >> >> Rohini > > > > >> >> >> >> > > > > >> >> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter > > > > >> >><rkan...@cloudera.com> > > > > >> >> >> >> wrote: > > > > >> >> >> >> > > > > >> >> >> >> > Ya, I saw a precondition failed message. > > > > >> >> >> >> > > > > > >> >> >> >> > I just tried out what happens when the job is > SUSPENDED, > > > the > > > > >> >> >>action is > > > > >> >> >> >> > START_MANUAL, and the JT recovers the hadoop job: It > > > doesn't > > > > >> >> >>continue > > > > >> >> >> >>the > > > > >> >> >> >> > workflow. It fails the eagerVerifyPrecondition from > > > > >> >> >> >> > CompletedActionXCommand because the action isn't > RUNNING. > > > > >> >>Perhaps > > > > >> >> >>we > > > > >> >> >> >> > should make the CallbackService change the status in > this > > > > >> >> >>situation? > > > > >> >> >> >> > > > > > >> >> >> >> > Just to clarify, the above only happens when the JT has > > > been > > > > >> >>down > > > > >> >> >>long > > > > >> >> >> >> > enough that the ActionCheckXCommand (every 10min by > > > default) > > > > >>+ > > > > >> >>the > > > > >> >> >> >> retries > > > > >> >> >> >> > (3 x 1min) happen. If it comes back sooner than that, > > > > >> >>everything > > > > >> >> >> >>works > > > > >> >> >> >> > fine. > > > > >> >> >> >> > > > > > >> >> >> >> > thanks > > > > >> >> >> >> > - Robert > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > > >> >> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari > > > > >> >><vi...@yahoo-inc.com > > > > >> >> > > > > > >> >> >> >> wrote: > > > > >> >> >> >> > > > > > >> >> >> >> > > Oh..okay. Seems like RecoveryService queues the > StartX > > > > >>command > > > > >> >> >>but > > > > >> >> >> >>the > > > > >> >> >> >> > > verifyPrecondition() fails as the wf job is > > > > >> >> >> >> > > Suspended (Plz verify this from logs). > > > > >> >> >> >> > > > > > > >> >> >> >> > > In that case, if Oozie is not auto-retrying and > > > > >>resubmitting, > > > > >> >> >>then > > > > >> >> >> >>it > > > > >> >> >> >> > > seems fair to have the JT recover the job. > > > > >> >> >> >> > > But if JT recovers the job, can we make sure that the > > > > >>workflow > > > > >> >> >>job > > > > >> >> >> >> > > transits to RUNNING from SUSPENDED and wf action from > > > > >> >> >>START_MANUAL > > > > >> >> >> >>to > > > > >> >> >> >> > > RUNNING? > > > > >> >> >> >> > > It should not happen that the user resumes the job > > which > > > > >>makes > > > > >> >> >>Oozie > > > > >> >> >> >> > > submit a new hadoop job while the JT is also > recovering > > > the > > > > >> >>same > > > > >> >> >> >>job. > > > > >> >> >> >> > > Also, I think the error can still be considered > > transient > > > > >>from > > > > >> >> >>Oozie > > > > >> >> >> >> > > perspective as it is temporary depending on state of > > JT. > > > > >> >> >> >> > > > > > > >> >> >> >> > > Thanks, > > > > >> >> >> >> > > Virag > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" < > > rkan...@cloudera.com > > > > > > > > >> >>wrote: > > > > >> >> >> >> > > > > > > >> >> >> >> > > >Virag, > > > > >> >> >> >> > > >I just tested out killing the JT and waiting for the > > > > >>Checker > > > > >> >> >> >>service > > > > >> >> >> >> to > > > > >> >> >> >> > > >retry and give up: the action goes to START_MANUAL > and > > > the > > > > >> >>job > > > > >> >> >>gets > > > > >> >> >> >> > > >SUSPENDED. I waited around long enough, but the > > > > >> >>RecoveryService > > > > >> >> >> >> didn't > > > > >> >> >> >> > do > > > > >> >> >> >> > > >anything. Does it kick in for you? As a side note, > > > > >>looking > > > > >> >>at > > > > >> >> >>the > > > > >> >> >> >> > code, > > > > >> >> >> >> > > >the RecoveryService looks like it can handle > > > START_MANUAL, > > > > >> >> >> >>END_MANUAL, > > > > >> >> >> >> > and > > > > >> >> >> >> > > >USER_RETRY, which all sound like things the user > > should > > > be > > > > >> >> >>doing; > > > > >> >> >> >>is > > > > >> >> >> >> it > > > > >> >> >> >> > > >correct that RecoveryService is handling these? > > > > >> >> >> >> > > >The Unknown Hadoop Job error happens when the JT > comes > > > > >>back > > > > >> >>in > > > > >> >> >>time > > > > >> >> >> >> > > >because > > > > >> >> >> >> > > >it won't know about the old ID if its not recovering > > > jobs. > > > > >> >>So, > > > > >> >> >> >>Oozie > > > > >> >> >> >> > > >tries > > > > >> >> >> >> > > >to ask it about a job that no longer exists. I'm > not > > > sure > > > > >> >>that > > > > >> >> >> >>this > > > > >> >> >> >> > > >should > > > > >> >> >> >> > > >be a transient error because there's no way to > > determine > > > > >>if > > > > >> >>its > > > > >> >> >> >> because > > > > >> >> >> >> > > >the > > > > >> >> >> >> > > >JT restarted and Oozie should resubmit the job or if > > > > >> >>something > > > > >> >> >>else > > > > >> >> >> >> > > >happened. > > > > >> >> >> >> > > > > > > > >> >> >> >> > > >Mayank, > > > > >> >> >> >> > > >That is a good point. We could either make a v3 API > > or > > > > >>add > > > > >> >>an > > > > >> >> >> >> > oozie-site > > > > >> >> >> >> > > >config to turn on/off the id swap behavior and keep > > the > > > v2 > > > > >> >>API. > > > > >> >> >> >> > > > > > > > >> >> >> >> > > >thanks > > > > >> >> >> >> > > >- Robert > > > > >> >> >> >> > > > > > > > >> >> >> >> > > > > > > > >> >> >> >> > > > > > > > >> >> >> >> > > > > > > > >> >> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal > > > > >> >> >><may...@apache.org> > > > > >> >> >> >> > wrote: > > > > >> >> >> >> > > > > > > > >> >> >> >> > > >> Robert, > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> Thats a break in backward compatibility. Till now > > user > > > > >>are > > > > >> >> >>used > > > > >> >> >> >>to > > > > >> >> >> >> > > >>click on > > > > >> >> >> >> > > >> to link to go to MR page. > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> Is there a better way to handle this? > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> Thanks, > > > > >> >> >> >> > > >> Mayank > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter < > > > > >> >> >> >> rkan...@cloudera.com> > > > > >> >> >> >> > > >> wrote: > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > Mona, > > > > >> >> >> >> > > >> > As far as I'm aware, the "retry" that Oozie is > > doing > > > > >>is > > > > >> >>just > > > > >> >> >> >> > retrying > > > > >> >> >> >> > > >>to > > > > >> >> >> >> > > >> > connect to the JT (which is why when the JT > comes > > > back > > > > >> >>up, > > > > >> >> >> >>Oozie > > > > >> >> >> >> > > >> > can continue monitoring the hadoop job if it > still > > > has > > > > >> >>the > > > > >> >> >>same > > > > >> >> >> >> ID); > > > > >> >> >> >> > > >>it > > > > >> >> >> >> > > >> > doesn't try to submit the job again as part of > the > > > > >> >>"retry". > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > Mayank, > > > > >> >> >> >> > > >> > We can put the ID for the actual job in the > Child > > > IDs > > > > >>tab > > > > >> >> >>(like > > > > >> >> >> >> with > > > > >> >> >> >> > > >> Pig). > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > - Robert > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal > > > > >> >> >> >><may...@apache.org > > > > >> >> >> >> > > > > > >> >> >> >> > > >> wrote: > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > I agree , we should handle these two > scenarios, > > I > > > > >>am ok > > > > >> >> >>with > > > > >> >> >> >> > > >>changing > > > > >> >> >> >> > > >> the > > > > >> >> >> >> > > >> > > launcher behavior for MR however if we remove > > the > > > id > > > > >> >>swap > > > > >> >> >> >>then > > > > >> >> >> >> how > > > > >> >> >> >> > > >>we > > > > >> >> >> >> > > >> > > nevigate to MR jobs from UI as we do right > now? > > > > >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > Thanks, > > > > >> >> >> >> > > >> > > Mayank > > > > >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter > > > > >> >> >> >> > > >><rkan...@cloudera.com> > > > > >> >> >> >> > > >> > > wrote: > > > > >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > > Suppose we leave the MR ID swap thing as is > > but > > > > >>set > > > > >> >>the > > > > >> >> >> >> launcher > > > > >> >> >> >> > > >> > recover > > > > >> >> >> >> > > >> > > to > > > > >> >> >> >> > > >> > > > 0 and job to 1; then consider these two > > > scenarios: > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > 1. JT gets restarted during the launcher job > > but > > > > >> >>before > > > > >> >> >>the > > > > >> >> >> >> > > >>launcher > > > > >> >> >> >> > > >> > job > > > > >> >> >> >> > > >> > > > actually launches the real job: > > > > >> >> >> >> > > >> > > > - The launcher job won't be recovered > > > > >>because we > > > > >> >> >>told > > > > >> >> >> >>it > > > > >> >> >> >> > not > > > > >> >> >> >> > > >>to > > > > >> >> >> >> > > >> > > > - The real job was never launched > > > > >> >> >> >> > > >> > > > ---> Action never completes and Oozie > > marks > > > > >>it > > > > >> >>as > > > > >> >> >> >>failed > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > 2. Launcher job submits the real job, but JT > > > gets > > > > >> >> >>restarted > > > > >> >> >> >> > before > > > > >> >> >> >> > > >> the > > > > >> >> >> >> > > >> > > > Oozie server has a chance to swap IDs (its > not > > > an > > > > >> >>atomic > > > > >> >> >> >> > > >>operation): > > > > >> >> >> >> > > >> > > > - The launcher job won't be recovered > > > > >>because we > > > > >> >> >>told > > > > >> >> >> >>it > > > > >> >> >> >> > not > > > > >> >> >> >> > > >>to > > > > >> >> >> >> > > >> > > > - The real job will be recovered and > > finish > > > > >> >> >> >>successfully > > > > >> >> >> >> > > >> > > > ---> Oozie marks the action as failed > > even > > > > >> >>though > > > > >> >> >>the > > > > >> >> >> >> > actual > > > > >> >> >> >> > > >>job > > > > >> >> >> >> > > >> > > > succeeded because it didn't know about the > ID > > > swap > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > It would only work for the case where the JT > > > gets > > > > >> >> >>restarted > > > > >> >> >> >> > after > > > > >> >> >> >> > > >>the > > > > >> >> >> >> > > >> > ID > > > > >> >> >> >> > > >> > > > swap occurs. > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > - Robert > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank > > Bansal < > > > > >> >> >> >> > may...@apache.org > > > > >> >> >> >> > > > > > > > >> >> >> >> > > >> > > wrote: > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > Hi Robert, > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and 0 to > > > jobs > > > > >>for > > > > >> >> >> >>recovery > > > > >> >> >> >> > in > > > > >> >> >> >> > > >>all > > > > >> >> >> >> > > >> > the > > > > >> >> >> >> > > >> > > > > cases except MR. > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > As after Id swapped Oozie only know about > MR > > > job > > > > >> >>isn't > > > > >> >> >> >>it? > > > > >> >> >> >> > then > > > > >> >> >> >> > > >> there > > > > >> >> >> >> > > >> > > > > should not be any problem. > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > If we set MR launcher recover to 0 and job > > to > > > 1 > > > > >> >>then > > > > >> >> >>job > > > > >> >> >> >> will > > > > >> >> >> >> > be > > > > >> >> >> >> > > >> > > succeded > > > > >> >> >> >> > > >> > > > > in case of JT restart. > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > AM I missing something? > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > Thanks, > > > > >> >> >> >> > > >> > > > > Mayank > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert > > Kanter > > > < > > > > >> >> >> >> > > >> rkan...@cloudera.com> > > > > >> >> >> >> > > >> > > > > wrote: > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > > I think you usually just get the > "Unknown > > > > >>Hadoop > > > > >> >> >>Job" > > > > >> >> >> >> error > > > > >> >> >> >> > > >> message > > > > >> >> >> >> > > >> > > > > because > > > > >> >> >> >> > > >> > > > > > Oozie tries to look up the Hadoop Job ID > > it > > > > >> >>already > > > > >> >> >> >>has, > > > > >> >> >> >> but > > > > >> >> >> >> > > >>the > > > > >> >> >> >> > > >> JT > > > > >> >> >> >> > > >> > > no > > > > >> >> >> >> > > >> > > > > > longer has that ID because it was > > restarted. > > > > >> >>With > > > > >> >> >>JT > > > > >> >> >> >> > > >> > Recoverability > > > > >> >> >> >> > > >> > > > > turned > > > > >> >> >> >> > > >> > > > > > on, it will restart the job using the > same > > > > >>ID, so > > > > >> >> >>Oozie > > > > >> >> >> >> > > >>continues > > > > >> >> >> >> > > >> > > just > > > > >> >> >> >> > > >> > > > > > fine. > > > > >> >> >> >> > > >> > > > > > > > > > >> >> >> >> > > >> > > > > > - Robert > > > > >> >> >> >> > > >> > > > > > > > > > >> >> >> >> > > >> > > > > > > > > > >> >> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini > > > > >> >>Palaniswamy > > > > >> >> >> >> > > >> > > > > > <rohini.adi...@gmail.com>wrote: > > > > >> >> >> >> > > >> > > > > > > > > > >> >> >> >> > > >> > > > > > > Wouldn't oozie poll for the job status > > and > > > > >> >>decide > > > > >> >> >> >>that > > > > >> >> >> >> it > > > > >> >> >> >> > > >>has > > > > >> >> >> >> > > >> > > failed > > > > >> >> >> >> > > >> > > > > and > > > > >> >> >> >> > > >> > > > > > > when JT comes up launch another one if > > > > >>retry is > > > > >> >> >> >> > configured? > > > > >> >> >> >> > > >> > > > > > > > > > > >> >> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert > > > > >>Kanter < > > > > >> >> >> >> > > >> > > rkan...@cloudera.com> > > > > >> >> >> >> > > >> > > > > > > wrote: > > > > >> >> >> >> > > >> > > > > > > > > > > >> >> >> >> > > >> > > > > > > > Hi, > > > > >> >> >> >> > > >> > > > > > > > > > > > >> >> >> >> > > >> > > > > > > > We looked into how to support Job > > > > >> >>Recoverability > > > > >> >> >> >>(i.e. > > > > >> >> >> >> > > >>the JT > > > > >> >> >> >> > > >> > is > > > > >> >> >> >> > > >> > > > > > > restarted > > > > >> >> >> >> > > >> > > > > > > > and it wants to restart the jobs > that > > > were > > > > >> >> >>running; > > > > >> >> >> >> > > >>similarly > > > > >> >> >> >> > > >> > for > > > > >> >> >> >> > > >> > > > > YARN) > > > > >> >> >> >> > > >> > > > > > > and > > > > >> >> >> >> > > >> > > > > > > > have a pretty simple solution for > all > > of > > > > >>the > > > > >> >> >>action > > > > >> >> >> >> > types > > > > >> >> >> >> > > >> > except > > > > >> >> >> >> > > >> > > > for > > > > >> >> >> >> > > >> > > > > > > > MapReduce. If we set > > > > >> >> >> >> mapreduce.job.restart.recover=true > > > > >> >> >> >> > > >>for > > > > >> >> >> >> > > >> > the > > > > >> >> >> >> > > >> > > > > > launcher > > > > >> >> >> >> > > >> > > > > > > > job and > > > > >>mapreduce.job.restart.recover=false > > > > >> >>for > > > > >> >> >>the > > > > >> >> >> >> jobs > > > > >> >> >> >> > > >> > launched > > > > >> >> >> >> > > >> > > > by > > > > >> >> >> >> > > >> > > > > > the > > > > >> >> >> >> > > >> > > > > > > > launcher, then when the JT restarts, > > it > > > > >>will > > > > >> >> >> >>recover > > > > >> >> >> >> the > > > > >> >> >> >> > > >> > launcher > > > > >> >> >> >> > > >> > > > job > > > > >> >> >> >> > > >> > > > > > but > > > > >> >> >> >> > > >> > > > > > > > not the child jobs -- the launcher > job > > > > >>will > > > > >> >>then > > > > >> >> >> >>take > > > > >> >> >> >> > > >>care of > > > > >> >> >> >> > > >> > > > > > relaunching > > > > >> >> >> >> > > >> > > > > > > > the child jobs. > > > > >> >> >> >> > > >> > > > > > > > > > > > >> >> >> >> > > >> > > > > > > > For MapReduce, because of the > > > optimization > > > > >> >>with > > > > >> >> >> >>the id > > > > >> >> >> >> > > >>swap, > > > > >> >> >> >> > > >> > this > > > > >> >> >> >> > > >> > > > > won't > > > > >> >> >> >> > > >> > > > > > > > work. It would be very tricky, if > > it's > > > > >>even > > > > >> >> >> >> practical, > > > > >> >> >> >> > > >>to do > > > > >> >> >> >> > > >> > > > > something > > > > >> >> >> >> > > >> > > > > > > > similar for the MR action. Instead, > > we > > > > >> >>think it > > > > >> >> >> >>would > > > > >> >> >> >> > be > > > > >> >> >> >> > > >> best > > > > >> >> >> >> > > >> > if > > > > >> >> >> >> > > >> > > > we > > > > >> >> >> >> > > >> > > > > > > simply > > > > >> >> >> >> > > >> > > > > > > > remove the MR optimization and make > it > > > > >>just > > > > >> >>like > > > > >> >> >> >>the > > > > >> >> >> >> > other > > > > >> >> >> >> > > >> > action > > > > >> >> >> >> > > >> > > > > > types. > > > > >> >> >> >> > > >> > > > > > > I > > > > >> >> >> >> > > >> > > > > > > > know we normally don't want to > remove > > > > >> >> >> >>optimizations, > > > > >> >> >> >> but > > > > >> >> >> >> > > >> there > > > > >> >> >> >> > > >> > > are > > > > >> >> >> >> > > >> > > > > many > > > > >> >> >> >> > > >> > > > > > > > advantages in this case, and it's > only > > > > >> >>saving a > > > > >> >> >> >>single > > > > >> >> >> >> > Map > > > > >> >> >> >> > > >> slot > > > > >> >> >> >> > > >> > > for > > > > >> >> >> >> > > >> > > > > MR > > > > >> >> >> >> > > >> > > > > > > jobs > > > > >> >> >> >> > > >> > > > > > > > only. > > > > >> >> >> >> > > >> > > > > > > > > > > > >> >> >> >> > > >> > > > > > > > I've created OOZIE-1483 < > > > > >> >> >> >> > > >> > > > > > > > > > > >> >>https://issues.apache.org/jira/browse/OOZIE-1483> > > > > >> >> >> >> > > >> > > > > > > > with > > > > >> >> >> >> > > >> > > > > > > > more details and should have a patch > > > soon. > > > > >> >> >> >> > > >> > > > > > > > > > > > >> >> >> >> > > >> > > > > > > > Thoughts? > > > > >> >> >> >> > > >> > > > > > > > > > > > >> >> >> >> > > >> > > > > > > > > > > > >> >> >> >> > > >> > > > > > > > thanks > > > > >> >> >> >> > > >> > > > > > > > - Robert > > > > >> >> >> >> > > >> > > > > > > > > > > > >> >> >> >> > > >> > > > > > > > > > > >> >> >> >> > > >> > > > > > > > > > >> >> >> >> > > >> > > > > > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > > >> >> >> >> > > > > > >> >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> > > > > >> >> > > > > >> > > > > > >> > > > > > >> >-- > > > > >> >Alejandro > > > > >> > > > > >> > > > > > > > > > > > > > > > > > -- > Alejandro >