There's also a bug that needs to be fixed: YARN-1058<https://issues.apache.org/jira/browse/YARN-1058>.
On Mon, Aug 12, 2013 at 4:05 PM, Robert Kanter <rkan...@cloudera.com> wrote: > Currently, job recoverability (at least for what Oozie needs) isn't quite > there yet in YARN. We're working on improving it in > YARN-1055<https://issues.apache.org/jira/browse/YARN-1055>. > > > - Robert > > > On Mon, Aug 12, 2013 at 2:06 PM, Rohini Palaniswamy < > rohini.adi...@gmail.com> wrote: > >> Tucu, >> Any idea on what is the status of job recoverability with YARN? Is it >> part of 2.1 release? Atleast I know that we don't have it supported in our >> clusters yet. I can check with our hadoop team if not. >> >> Regards, >> Rohini >> >> On Thu, Aug 8, 2013 at 1:30 PM, Alejandro Abdelnur <t...@cloudera.com >> >wrote: >> >> > the change mentioned in 1) is a bug, a nasty one. This is a problem >> with JT >> > recovery turned ON or OFF and with any version of Hadoop. >> > >> > It has to be fixed. >> > >> > Also, Hadoop 1 JT job recovery is stable and works as expected. >> > >> > Thanks. >> > >> > >> > On Thu, Aug 8, 2013 at 10:56 AM, Rohini Palaniswamy < >> > rohini.adi...@gmail.com >> > > wrote: >> > >> > > Haven't gone through the whole thread in detail yet. But looking at >> the >> > > change mentioned in 1), the first thing that comes to my mind is that >> it >> > > might not work as expected if job recoverability is not turned on. We >> > need >> > > to consider that case. We cannot expect everyone to be in the latest >> > > version of hadoop and have recoverability turned on. Job >> recoverability >> > in >> > > hadoop is not fully mature yet and not tested well. >> > > >> > > On Thu, Aug 8, 2013 at 10:17 AM, Robert Kanter <rkan...@cloudera.com> >> > > wrote: >> > > >> > > > So, does this sound good? >> > > > >> > > > 1) Create a JIRA to make the ActionCheckXCommand leave the action >> > RUNNING >> > > > instead of START_MANUAL and ResumeXCommand shouldn't resubmit the >> job >> > > > 2) OOZIE-1483 to remove the MR optimization and set the launcher >> job to >> > > > recover but not the real job >> > > > >> > > > The property to set a job to not recover wasn't added until Hadoop >> > 1.2.0 >> > > > and we're using 1.1.1, so we'll also need: >> > > > 3) Create a JIRA to bump up the Hadoop version to 1.2.x >> > > > >> > > > There's also a problem with the DistCp action where DistCp doesn't >> > > actually >> > > > read the jobconf that Oozie prepares, and recoverability is enabled >> by >> > > > default on all jobs, so we can't disable it for the DistCp action >> until >> > > > DistCp is updated accordingly and we switch to a Hadoop release with >> > that >> > > > fix, so we'll also need: >> > > > 4) A MAPREDUCE JIRA to make DistCp accept a jobconf >> > > > In the meantime, this will have to be a known issue where if the JT >> is >> > > > restarted with recoverability, you'll end up with two hadoop jobs >> > running >> > > > DistCp >> > > > >> > > > And what should we do about the external id being the launcher job >> > > instead >> > > > of the real job after removing the MR optimization? >> > > > >> > > > >> > > > thanks >> > > > - Robert >> > > > >> > > > >> > > > >> > > > >> > > > On Wed, Aug 7, 2013 at 8:45 PM, Virag Kothari <vi...@yahoo-inc.com> >> > > wrote: >> > > > >> > > > > Ahh..I forgot about Oozie-994. My bad, I suggested that change. >> > > > Everything >> > > > > makes sense now. Thanks! >> > > > > >> > > > > On 8/7/13 7:38 PM, "Robert Kanter" <rkan...@cloudera.com> wrote: >> > > > > >> > > > > >The behavior where the ActionCheckXCommand calls >> > handleNonTransient() >> > > > with >> > > > > >START_MANUAL when the JT can't be reached after the retries and >> on >> > > > RESUME >> > > > > >command will resubmit the job was something I did for OOZIE-994. >> In >> > > > > >hindsight, we shouldn't have done it that way. >> > > > > > >> > > > > >Yes, it will fail if job recovery is not enabled in the JT/RM; >> but I >> > > > think >> > > > > >this is the more correct behavior as this is something that the >> > > external >> > > > > >system should be taking care of. >> > > > > > >> > > > > >- Robert >> > > > > > >> > > > > > >> > > > > >On Wed, Aug 7, 2013 at 5:05 PM, Virag Kothari < >> vi...@yahoo-inc.com> >> > > > > wrote: >> > > > > > >> > > > > >> Alejandro, I agree that functionality would be preserved if >> action >> > > is >> > > > > >>left >> > > > > >> in RUNNING during a transient error. >> > > > > >> >> > > > > >> Few questions >> > > > > >> >> > > > > >> 1) START_MANUAL seems to be set only by handleNonTransient(). >> If >> > > this >> > > > > >>is a >> > > > > >> bug, do you know for what purpose it was introduced? >> > > > > >> I thought having START_MANUAL is a way to distinguish >> between >> > > Oozie >> > > > > >> suspending job due to transient error and a user manually >> > suspending >> > > > the >> > > > > >> job. >> > > > > >> >> > > > > >> 2) With no oozie retry on 'RESUME', jobs will fail if JT/RM >> > recovery >> > > > is >> > > > > >> not enabled. And it seems that YARN recovery is still not >> there as >> > > > > >> YARN-128 is not yet committed (Not sure if looking at right >> JIRA). >> > > > > >> Its a concern for us as we ask users to RESUME their jobs >> after >> > > > hadoop >> > > > > >> upgrade. Now they have to resume wf and rerun the failed >> actions. >> > > > > >> >> > > > > >> Thanks, >> > > > > >> Virag >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> On 8/7/13 2:48 PM, "Alejandro Abdelnur" <t...@cloudera.com> >> > wrote: >> > > > > >> >> > > > > >> >[joining the party a bit late] >> > > > > >> > >> > > > > >> >I just add an offline call with RobertK who brought me up to >> > speed. >> > > > > >> > >> > > > > >> >By design, Oozie will retry starting a workflow action ONLY >> if it >> > > > > >>couldn't >> > > > > >> >start the WF action before. If Oozie started the WF action >> > > > > >>successfully, >> > > > > >> >the WF action state goes into RUNNING, and from then on it is >> the >> > > > > >> >responsibility of the external system running the action to >> > recover >> > > > it. >> > > > > >> >Oozie will not attempt any recovery after that point. >> > > > > >> > >> > > > > >> >This means that with Hadoop (JT or YARN) job recovery, the >> > > launcher >> > > > > >>job >> > > > > >> >will be recovered by Hadoop without any intervention from >> Oozie. >> > > > > >> > >> > > > > >> >It is clear that to have recovery for MR action we need to >> get >> > > rid >> > > > of >> > > > > >> >the >> > > > > >> >swap and just hold onto the MR launcher job as we do for the >> > other >> > > > > >> >actions. >> > > > > >> > >> > > > > >> >Now, on the whole discussion on the ActionCheckXCommand >> retries. >> > We >> > > > > >>have a >> > > > > >> >bug in the ActionCheckXCommand, on handleNonTransient() we >> should >> > > not >> > > > > >> >change the status of the WF action to START_MANUAL, we should >> > leave >> > > > it >> > > > > >>in >> > > > > >> >RUNNING. hadnleNonTransient() will suspend the WF job thus >> > > switching >> > > > > >>off >> > > > > >> >action checks. On WF job resume, the action checks will start >> > > working >> > > > > >> >again, and if Hadoop has job recovery, things will work fine. >> > Else >> > > > the >> > > > > >>WF >> > > > > >> >action will fail because the launcher job is not known (the >> > > external >> > > > > >> >system >> > > > > >> >does not know how to recover jobs). Because we are reseting >> the >> > > > status >> > > > > >>to >> > > > > >> >START_MANUAL we are dialing back on the lifecycle of the >> action, >> > > that >> > > > > >>is >> > > > > >> >incorrect and that creates the race condition that introduces >> 2 >> > > jobs. >> > > > > >> > >> > > > > >> >So again, Oozie is not responsible for recovering actions. >> With >> > > that >> > > > > >> >assumption, fixing the handleNonTransient() to leave the >> status >> > in >> > > > > >>RUNNING >> > > > > >> >and getting rid of the RM swap logic we should be good. >> > > > > >> > >> > > > > >> >Thoughts? >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> >On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari < >> > > vi...@yahoo-inc.com> >> > > > > >> >wrote: >> > > > > >> > >> > > > > >> >> Robert, >> > > > > >> >> >> > > > > >> >> I have been thinking on this for a while and have few more >> > > concerns >> > > > > >>if >> > > > > >> >>the >> > > > > >> >> job retries are not streamlined through Oozie. >> > > > > >> >> >> > > > > >> >> 1) Till the JT finishes recovering the job, the wf job/wf >> > action >> > > > > >>status >> > > > > >> >> will be SUSPENDED/START_MANUAL. >> > > > > >> >> Isn't it misleading as the hadoop job is RUNNING while oozie >> > > > > >>incorrectly >> > > > > >> >> shows as SUSPENDED? Even if allow this, after the job >> > completes, >> > > > > >>what if >> > > > > >> >> the callback is lost or oozie is down? >> > > > > >> >> To prevent the job being in SUSPENDED forever, we need to >> hack >> > > our >> > > > > >> >> services to pull SUSPENDED/START_MANUAL jobs from db and >> update >> > > > their >> > > > > >> >> status. >> > > > > >> >> >> > > > > >> >> 2) Should we allow failing of the user RESUME command if the >> > > action >> > > > > >>is >> > > > > >> >>in >> > > > > >> >> START_MANUAL to prevent the race condition we were >> discussing? >> > > > > >> >> This would mean changing the semantics of the states. >> > > > > >> >> >> > > > > >> >> 3) Confused on mapred.job.restart.recover. Reading >> > > > > >> >> >> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html >> > , >> > > it >> > > > > >>says >> > > > > >> >> that the default value of this is true. So, >> > > > > >> >> if mapred.jobtracker.restart.recover (system config) is >> already >> > > > > >>enabled, >> > > > > >> >> is job recovery on by default? Also, does recover mean the >> job >> > > will >> > > > > >> >>start >> > > > > >> >> where it left from or is it just plain restart? >> > > > > >> >> >> > > > > >> >> In summary, IMO allowing hadoop to recover jobs >> independently >> > > > > >>bypassing >> > > > > >> >> Oozie ins't trivial. It would have helped if the JT produced >> > > > > >> >>notification >> > > > > >> >> when it comes online, so Oozie could retry after consuming >> > those. >> > > > But >> > > > > >> >> currently, notification only happens when task completes. >> > > > > >> >> >> > > > > >> >> An alternate approach is to modify the semantics of >> > START_MANUAL. >> > > > > >> >> Currently Oozie puts the action/job in >> START_MANUAL/SUSPENDED >> > and >> > > > > >> >>expects >> > > > > >> >> the user to resume it. We can change this and make Oozie >> retry >> > > the >> > > > > >> >> START_MANUAL actions at configurable interval (~30 mins or >> some >> > > > > >>scheme >> > > > > >> >> like exp back off) . Of course, this is is bad as oozie will >> > keep >> > > > > >> >>polling >> > > > > >> >> hadoop at some interval but manual resume of jobs who have >> > faced >> > > > > >> >>transient >> > > > > >> >> errors will no longer be mandatory. >> > > > > >> >> >> > > > > >> >> --Virag >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> On 8/6/13 4:38 PM, "Robert Kanter" <rkan...@cloudera.com> >> > wrote: >> > > > > >> >> >> > > > > >> >> >If ActionCheckX is trying to retry, and the JT recovers the >> > job, >> > > > > >>that >> > > > > >> >> >should be fine. The "retry" is to simply try connecting to >> > the >> > > JT >> > > > > >>to >> > > > > >> >>get >> > > > > >> >> >the status for the job. If the user issues a "RESUME" for >> a >> > > > > >> >>START_MANUAL >> > > > > >> >> >job, then yes, Oozie will try to resubmit a new job for >> that >> > > > action >> > > > > >>and >> > > > > >> >> >we'd have two of them if the JT also recovers it. >> > > > > >> >> > >> > > > > >> >> >What if we modified the >> > ActionStartXCommand/ResumeActionXCommand >> > > > > >> >> >precondition to check if the action already has a Job ID >> that >> > is >> > > > > >>valid >> > > > > >> >> >(i.e. not unknown to the JT), then it fails the >> precondition >> > > check >> > > > > >>or >> > > > > >> >> >something similar? >> > > > > >> >> > >> > > > > >> >> >- Robert >> > > > > >> >> > >> > > > > >> >> > >> > > > > >> >> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari < >> > > > vi...@yahoo-inc.com> >> > > > > >> >> wrote: >> > > > > >> >> > >> > > > > >> >> >> ActionCheckx first retries for a configurable amount of >> time >> > > and >> > > > > >>then >> > > > > >> >> >> makes the status as START_MANUAL. >> > > > > >> >> >> So, the problem might happen when JT recovers the job >> during >> > > the >> > > > > >>same >> > > > > >> >> >>time >> > > > > >> >> >> when 1) ActionCheckX is trying to retry or the 2) user >> > issues >> > > a >> > > > > >> >>"RESUME" >> > > > > >> >> >> for a start_manual job. >> > > > > >> >> >> We have to ensure that this doesn't happen otherwise we >> will >> > > > have >> > > > > >>two >> > > > > >> >> >> hadoop jobs for the same action. >> > > > > >> >> >> The callback happens only when the task is completed >> which >> > > might >> > > > > >>be >> > > > > >> >>too >> > > > > >> >> >> late. During that time, Oozie might have already >> submitted a >> > > new >> > > > > >> >>hadoop >> > > > > >> >> >> job for that wf action. >> > > > > >> >> >> So it doesn't seem straightforward to prevent Oozie to >> > submit >> > > a >> > > > > >>new >> > > > > >> >>job >> > > > > >> >> >>if >> > > > > >> >> >> the JT is already recovering the older one. >> > > > > >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> On 8/6/13 4:01 PM, "Robert Kanter" <rkan...@cloudera.com >> > >> > > > wrote: >> > > > > >> >> >> >> > > > > >> >> >> >Yes, if JT recovers the job, it uses the same ID. If >> the >> > JT >> > > > > >>comes >> > > > > >> >>up >> > > > > >> >> >> >quickly and recovers the job, Oozie continues working >> just >> > > fine >> > > > > >> >> >>(without >> > > > > >> >> >> >the ID swap issues discussed earlier). When the JT >> takes >> > > > longer >> > > > > >> >>than >> > > > > >> >> >>the >> > > > > >> >> >> >10min ActionCheck interval, and the action is >> START_MANUAL, >> > > > that >> > > > > >> >>still >> > > > > >> >> >> >needs to be figured out. >> > > > > >> >> >> > >> > > > > >> >> >> >I haven't tested on Hadoop 2.x yet, but I've been told >> that >> > > it >> > > > > >> >>should >> > > > > >> >> >>have >> > > > > >> >> >> >the same behavior. The only differences are that the >> name >> > of >> > > > the >> > > > > >> >> >>property >> > > > > >> >> >> >to enable recoverability on the server (not the >> job-level >> > > one) >> > > > is >> > > > > >> >> >> >different >> > > > > >> >> >> >obviously because it doesn't have "jobtracker" in it >> and it >> > > can >> > > > > >>also >> > > > > >> >> >> >recover the completed tasks, which shouldn't be a >> problem >> > > > because >> > > > > >> >>the >> > > > > >> >> >> >launcher jar has the one task. I'll of course double >> check >> > > > this >> > > > > >> >> >>though. >> > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> >- Robert >> > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy >> > > > > >> >> >> ><rohini.adi...@gmail.com>wrote: >> > > > > >> >> >> > >> > > > > >> >> >> >> Robert, >> > > > > >> >> >> >> You will not get a unknown hadoop job if JT has >> retry >> > > > > >> >>configured >> > > > > >> >> >> >>right? >> > > > > >> >> >> >> What happens in that case? Especially what happens >> when >> > > Oozie >> > > > > >> >>retry >> > > > > >> >> >> >>happens >> > > > > >> >> >> >> when JT comes up quickly? Also do you know what is >> the >> > > > > >>behaviour >> > > > > >> >> >>with >> > > > > >> >> >> >> Hadoop 2.x ? >> > > > > >> >> >> >> >> > > > > >> >> >> >> Mayank, >> > > > > >> >> >> >> OOZIE-1231 already has the changes to show Mapreduce >> > job >> > > id >> > > > > >>in >> > > > > >> >>the >> > > > > >> >> >> >>Child >> > > > > >> >> >> >> job page to be consistent with other job types. The v1 >> > API >> > > > has >> > > > > >>the >> > > > > >> >> >>older >> > > > > >> >> >> >> behaviour with map job url in externalId, while v2 API >> > has >> > > it >> > > > > >>in >> > > > > >> >> >> >> childjobids. So there is a UI change but v1 REST API >> has >> > > not >> > > > > >> >> >>changed. >> > > > > >> >> >> >>But >> > > > > >> >> >> >> OOZIE-1231 has not changed any code with respect to id >> > > swap. >> > > > > >> >> >> >> >> > > > > >> >> >> >> Regards, >> > > > > >> >> >> >> Rohini >> > > > > >> >> >> >> >> > > > > >> >> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter >> > > > > >> >><rkan...@cloudera.com> >> > > > > >> >> >> >> wrote: >> > > > > >> >> >> >> >> > > > > >> >> >> >> > Ya, I saw a precondition failed message. >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > I just tried out what happens when the job is >> > SUSPENDED, >> > > > the >> > > > > >> >> >>action is >> > > > > >> >> >> >> > START_MANUAL, and the JT recovers the hadoop job: It >> > > > doesn't >> > > > > >> >> >>continue >> > > > > >> >> >> >>the >> > > > > >> >> >> >> > workflow. It fails the eagerVerifyPrecondition from >> > > > > >> >> >> >> > CompletedActionXCommand because the action isn't >> > RUNNING. >> > > > > >> >>Perhaps >> > > > > >> >> >>we >> > > > > >> >> >> >> > should make the CallbackService change the status in >> > this >> > > > > >> >> >>situation? >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > Just to clarify, the above only happens when the JT >> has >> > > > been >> > > > > >> >>down >> > > > > >> >> >>long >> > > > > >> >> >> >> > enough that the ActionCheckXCommand (every 10min by >> > > > default) >> > > > > >>+ >> > > > > >> >>the >> > > > > >> >> >> >> retries >> > > > > >> >> >> >> > (3 x 1min) happen. If it comes back sooner than >> that, >> > > > > >> >>everything >> > > > > >> >> >> >>works >> > > > > >> >> >> >> > fine. >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > thanks >> > > > > >> >> >> >> > - Robert >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari >> > > > > >> >><vi...@yahoo-inc.com >> > > > > >> >> > >> > > > > >> >> >> >> wrote: >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > > Oh..okay. Seems like RecoveryService queues the >> > StartX >> > > > > >>command >> > > > > >> >> >>but >> > > > > >> >> >> >>the >> > > > > >> >> >> >> > > verifyPrecondition() fails as the wf job is >> > > > > >> >> >> >> > > Suspended (Plz verify this from logs). >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > In that case, if Oozie is not auto-retrying and >> > > > > >>resubmitting, >> > > > > >> >> >>then >> > > > > >> >> >> >>it >> > > > > >> >> >> >> > > seems fair to have the JT recover the job. >> > > > > >> >> >> >> > > But if JT recovers the job, can we make sure that >> the >> > > > > >>workflow >> > > > > >> >> >>job >> > > > > >> >> >> >> > > transits to RUNNING from SUSPENDED and wf action >> from >> > > > > >> >> >>START_MANUAL >> > > > > >> >> >> >>to >> > > > > >> >> >> >> > > RUNNING? >> > > > > >> >> >> >> > > It should not happen that the user resumes the job >> > > which >> > > > > >>makes >> > > > > >> >> >>Oozie >> > > > > >> >> >> >> > > submit a new hadoop job while the JT is also >> > recovering >> > > > the >> > > > > >> >>same >> > > > > >> >> >> >>job. >> > > > > >> >> >> >> > > Also, I think the error can still be considered >> > > transient >> > > > > >>from >> > > > > >> >> >>Oozie >> > > > > >> >> >> >> > > perspective as it is temporary depending on state >> of >> > > JT. >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > Thanks, >> > > > > >> >> >> >> > > Virag >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" < >> > > rkan...@cloudera.com >> > > > > >> > > > > >> >>wrote: >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >Virag, >> > > > > >> >> >> >> > > >I just tested out killing the JT and waiting for >> the >> > > > > >>Checker >> > > > > >> >> >> >>service >> > > > > >> >> >> >> to >> > > > > >> >> >> >> > > >retry and give up: the action goes to >> START_MANUAL >> > and >> > > > the >> > > > > >> >>job >> > > > > >> >> >>gets >> > > > > >> >> >> >> > > >SUSPENDED. I waited around long enough, but the >> > > > > >> >>RecoveryService >> > > > > >> >> >> >> didn't >> > > > > >> >> >> >> > do >> > > > > >> >> >> >> > > >anything. Does it kick in for you? As a side >> note, >> > > > > >>looking >> > > > > >> >>at >> > > > > >> >> >>the >> > > > > >> >> >> >> > code, >> > > > > >> >> >> >> > > >the RecoveryService looks like it can handle >> > > > START_MANUAL, >> > > > > >> >> >> >>END_MANUAL, >> > > > > >> >> >> >> > and >> > > > > >> >> >> >> > > >USER_RETRY, which all sound like things the user >> > > should >> > > > be >> > > > > >> >> >>doing; >> > > > > >> >> >> >>is >> > > > > >> >> >> >> it >> > > > > >> >> >> >> > > >correct that RecoveryService is handling these? >> > > > > >> >> >> >> > > >The Unknown Hadoop Job error happens when the JT >> > comes >> > > > > >>back >> > > > > >> >>in >> > > > > >> >> >>time >> > > > > >> >> >> >> > > >because >> > > > > >> >> >> >> > > >it won't know about the old ID if its not >> recovering >> > > > jobs. >> > > > > >> >>So, >> > > > > >> >> >> >>Oozie >> > > > > >> >> >> >> > > >tries >> > > > > >> >> >> >> > > >to ask it about a job that no longer exists. I'm >> > not >> > > > sure >> > > > > >> >>that >> > > > > >> >> >> >>this >> > > > > >> >> >> >> > > >should >> > > > > >> >> >> >> > > >be a transient error because there's no way to >> > > determine >> > > > > >>if >> > > > > >> >>its >> > > > > >> >> >> >> because >> > > > > >> >> >> >> > > >the >> > > > > >> >> >> >> > > >JT restarted and Oozie should resubmit the job >> or if >> > > > > >> >>something >> > > > > >> >> >>else >> > > > > >> >> >> >> > > >happened. >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > >Mayank, >> > > > > >> >> >> >> > > >That is a good point. We could either make a v3 >> API >> > > or >> > > > > >>add >> > > > > >> >>an >> > > > > >> >> >> >> > oozie-site >> > > > > >> >> >> >> > > >config to turn on/off the id swap behavior and >> keep >> > > the >> > > > v2 >> > > > > >> >>API. >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > >thanks >> > > > > >> >> >> >> > > >- Robert >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal >> > > > > >> >> >><may...@apache.org> >> > > > > >> >> >> >> > wrote: >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > >> Robert, >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> Thats a break in backward compatibility. Till >> now >> > > user >> > > > > >>are >> > > > > >> >> >>used >> > > > > >> >> >> >>to >> > > > > >> >> >> >> > > >>click on >> > > > > >> >> >> >> > > >> to link to go to MR page. >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> Is there a better way to handle this? >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> Thanks, >> > > > > >> >> >> >> > > >> Mayank >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter >> < >> > > > > >> >> >> >> rkan...@cloudera.com> >> > > > > >> >> >> >> > > >> wrote: >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> > Mona, >> > > > > >> >> >> >> > > >> > As far as I'm aware, the "retry" that Oozie >> is >> > > doing >> > > > > >>is >> > > > > >> >>just >> > > > > >> >> >> >> > retrying >> > > > > >> >> >> >> > > >>to >> > > > > >> >> >> >> > > >> > connect to the JT (which is why when the JT >> > comes >> > > > back >> > > > > >> >>up, >> > > > > >> >> >> >>Oozie >> > > > > >> >> >> >> > > >> > can continue monitoring the hadoop job if it >> > still >> > > > has >> > > > > >> >>the >> > > > > >> >> >>same >> > > > > >> >> >> >> ID); >> > > > > >> >> >> >> > > >>it >> > > > > >> >> >> >> > > >> > doesn't try to submit the job again as part >> of >> > the >> > > > > >> >>"retry". >> > > > > >> >> >> >> > > >> > >> > > > > >> >> >> >> > > >> > Mayank, >> > > > > >> >> >> >> > > >> > We can put the ID for the actual job in the >> > Child >> > > > IDs >> > > > > >>tab >> > > > > >> >> >>(like >> > > > > >> >> >> >> with >> > > > > >> >> >> >> > > >> Pig). >> > > > > >> >> >> >> > > >> > >> > > > > >> >> >> >> > > >> > >> > > > > >> >> >> >> > > >> > - Robert >> > > > > >> >> >> >> > > >> > >> > > > > >> >> >> >> > > >> > >> > > > > >> >> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank >> Bansal >> > > > > >> >> >> >><may...@apache.org >> > > > > >> >> >> >> > >> > > > > >> >> >> >> > > >> wrote: >> > > > > >> >> >> >> > > >> > >> > > > > >> >> >> >> > > >> > > I agree , we should handle these two >> > scenarios, >> > > I >> > > > > >>am ok >> > > > > >> >> >>with >> > > > > >> >> >> >> > > >>changing >> > > > > >> >> >> >> > > >> the >> > > > > >> >> >> >> > > >> > > launcher behavior for MR however if we >> remove >> > > the >> > > > id >> > > > > >> >>swap >> > > > > >> >> >> >>then >> > > > > >> >> >> >> how >> > > > > >> >> >> >> > > >>we >> > > > > >> >> >> >> > > >> > > nevigate to MR jobs from UI as we do right >> > now? >> > > > > >> >> >> >> > > >> > > >> > > > > >> >> >> >> > > >> > > Thanks, >> > > > > >> >> >> >> > > >> > > Mayank >> > > > > >> >> >> >> > > >> > > >> > > > > >> >> >> >> > > >> > > >> > > > > >> >> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert >> Kanter >> > > > > >> >> >> >> > > >><rkan...@cloudera.com> >> > > > > >> >> >> >> > > >> > > wrote: >> > > > > >> >> >> >> > > >> > > >> > > > > >> >> >> >> > > >> > > > Suppose we leave the MR ID swap thing as >> is >> > > but >> > > > > >>set >> > > > > >> >>the >> > > > > >> >> >> >> launcher >> > > > > >> >> >> >> > > >> > recover >> > > > > >> >> >> >> > > >> > > to >> > > > > >> >> >> >> > > >> > > > 0 and job to 1; then consider these two >> > > > scenarios: >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > 1. JT gets restarted during the launcher >> job >> > > but >> > > > > >> >>before >> > > > > >> >> >>the >> > > > > >> >> >> >> > > >>launcher >> > > > > >> >> >> >> > > >> > job >> > > > > >> >> >> >> > > >> > > > actually launches the real job: >> > > > > >> >> >> >> > > >> > > > - The launcher job won't be >> recovered >> > > > > >>because we >> > > > > >> >> >>told >> > > > > >> >> >> >>it >> > > > > >> >> >> >> > not >> > > > > >> >> >> >> > > >>to >> > > > > >> >> >> >> > > >> > > > - The real job was never launched >> > > > > >> >> >> >> > > >> > > > ---> Action never completes and >> Oozie >> > > marks >> > > > > >>it >> > > > > >> >>as >> > > > > >> >> >> >>failed >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > 2. Launcher job submits the real job, >> but JT >> > > > gets >> > > > > >> >> >>restarted >> > > > > >> >> >> >> > before >> > > > > >> >> >> >> > > >> the >> > > > > >> >> >> >> > > >> > > > Oozie server has a chance to swap IDs >> (its >> > not >> > > > an >> > > > > >> >>atomic >> > > > > >> >> >> >> > > >>operation): >> > > > > >> >> >> >> > > >> > > > - The launcher job won't be >> recovered >> > > > > >>because we >> > > > > >> >> >>told >> > > > > >> >> >> >>it >> > > > > >> >> >> >> > not >> > > > > >> >> >> >> > > >>to >> > > > > >> >> >> >> > > >> > > > - The real job will be recovered and >> > > finish >> > > > > >> >> >> >>successfully >> > > > > >> >> >> >> > > >> > > > ---> Oozie marks the action as >> failed >> > > even >> > > > > >> >>though >> > > > > >> >> >>the >> > > > > >> >> >> >> > actual >> > > > > >> >> >> >> > > >>job >> > > > > >> >> >> >> > > >> > > > succeeded because it didn't know about >> the >> > ID >> > > > swap >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > It would only work for the case where >> the JT >> > > > gets >> > > > > >> >> >>restarted >> > > > > >> >> >> >> > after >> > > > > >> >> >> >> > > >>the >> > > > > >> >> >> >> > > >> > ID >> > > > > >> >> >> >> > > >> > > > swap occurs. >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > - Robert >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank >> > > Bansal < >> > > > > >> >> >> >> > may...@apache.org >> > > > > >> >> >> >> > > > >> > > > > >> >> >> >> > > >> > > wrote: >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > > > Hi Robert, >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and >> 0 to >> > > > jobs >> > > > > >>for >> > > > > >> >> >> >>recovery >> > > > > >> >> >> >> > in >> > > > > >> >> >> >> > > >>all >> > > > > >> >> >> >> > > >> > the >> > > > > >> >> >> >> > > >> > > > > cases except MR. >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > As after Id swapped Oozie only know >> about >> > MR >> > > > job >> > > > > >> >>isn't >> > > > > >> >> >> >>it? >> > > > > >> >> >> >> > then >> > > > > >> >> >> >> > > >> there >> > > > > >> >> >> >> > > >> > > > > should not be any problem. >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > If we set MR launcher recover to 0 and >> job >> > > to >> > > > 1 >> > > > > >> >>then >> > > > > >> >> >>job >> > > > > >> >> >> >> will >> > > > > >> >> >> >> > be >> > > > > >> >> >> >> > > >> > > succeded >> > > > > >> >> >> >> > > >> > > > > in case of JT restart. >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > AM I missing something? >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > Thanks, >> > > > > >> >> >> >> > > >> > > > > Mayank >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert >> > > Kanter >> > > > < >> > > > > >> >> >> >> > > >> rkan...@cloudera.com> >> > > > > >> >> >> >> > > >> > > > > wrote: >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > > > I think you usually just get the >> > "Unknown >> > > > > >>Hadoop >> > > > > >> >> >>Job" >> > > > > >> >> >> >> error >> > > > > >> >> >> >> > > >> message >> > > > > >> >> >> >> > > >> > > > > because >> > > > > >> >> >> >> > > >> > > > > > Oozie tries to look up the Hadoop >> Job ID >> > > it >> > > > > >> >>already >> > > > > >> >> >> >>has, >> > > > > >> >> >> >> but >> > > > > >> >> >> >> > > >>the >> > > > > >> >> >> >> > > >> JT >> > > > > >> >> >> >> > > >> > > no >> > > > > >> >> >> >> > > >> > > > > > longer has that ID because it was >> > > restarted. >> > > > > >> >>With >> > > > > >> >> >>JT >> > > > > >> >> >> >> > > >> > Recoverability >> > > > > >> >> >> >> > > >> > > > > turned >> > > > > >> >> >> >> > > >> > > > > > on, it will restart the job using the >> > same >> > > > > >>ID, so >> > > > > >> >> >>Oozie >> > > > > >> >> >> >> > > >>continues >> > > > > >> >> >> >> > > >> > > just >> > > > > >> >> >> >> > > >> > > > > > fine. >> > > > > >> >> >> >> > > >> > > > > > >> > > > > >> >> >> >> > > >> > > > > > - Robert >> > > > > >> >> >> >> > > >> > > > > > >> > > > > >> >> >> >> > > >> > > > > > >> > > > > >> >> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, >> Rohini >> > > > > >> >>Palaniswamy >> > > > > >> >> >> >> > > >> > > > > > <rohini.adi...@gmail.com>wrote: >> > > > > >> >> >> >> > > >> > > > > > >> > > > > >> >> >> >> > > >> > > > > > > Wouldn't oozie poll for the job >> status >> > > and >> > > > > >> >>decide >> > > > > >> >> >> >>that >> > > > > >> >> >> >> it >> > > > > >> >> >> >> > > >>has >> > > > > >> >> >> >> > > >> > > failed >> > > > > >> >> >> >> > > >> > > > > and >> > > > > >> >> >> >> > > >> > > > > > > when JT comes up launch another >> one if >> > > > > >>retry is >> > > > > >> >> >> >> > configured? >> > > > > >> >> >> >> > > >> > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, >> Robert >> > > > > >>Kanter < >> > > > > >> >> >> >> > > >> > > rkan...@cloudera.com> >> > > > > >> >> >> >> > > >> > > > > > > wrote: >> > > > > >> >> >> >> > > >> > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > > Hi, >> > > > > >> >> >> >> > > >> > > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > > We looked into how to support Job >> > > > > >> >>Recoverability >> > > > > >> >> >> >>(i.e. >> > > > > >> >> >> >> > > >>the JT >> > > > > >> >> >> >> > > >> > is >> > > > > >> >> >> >> > > >> > > > > > > restarted >> > > > > >> >> >> >> > > >> > > > > > > > and it wants to restart the jobs >> > that >> > > > were >> > > > > >> >> >>running; >> > > > > >> >> >> >> > > >>similarly >> > > > > >> >> >> >> > > >> > for >> > > > > >> >> >> >> > > >> > > > > YARN) >> > > > > >> >> >> >> > > >> > > > > > > and >> > > > > >> >> >> >> > > >> > > > > > > > have a pretty simple solution for >> > all >> > > of >> > > > > >>the >> > > > > >> >> >>action >> > > > > >> >> >> >> > types >> > > > > >> >> >> >> > > >> > except >> > > > > >> >> >> >> > > >> > > > for >> > > > > >> >> >> >> > > >> > > > > > > > MapReduce. If we set >> > > > > >> >> >> >> mapreduce.job.restart.recover=true >> > > > > >> >> >> >> > > >>for >> > > > > >> >> >> >> > > >> > the >> > > > > >> >> >> >> > > >> > > > > > launcher >> > > > > >> >> >> >> > > >> > > > > > > > job and >> > > > > >>mapreduce.job.restart.recover=false >> > > > > >> >>for >> > > > > >> >> >>the >> > > > > >> >> >> >> jobs >> > > > > >> >> >> >> > > >> > launched >> > > > > >> >> >> >> > > >> > > > by >> > > > > >> >> >> >> > > >> > > > > > the >> > > > > >> >> >> >> > > >> > > > > > > > launcher, then when the JT >> restarts, >> > > it >> > > > > >>will >> > > > > >> >> >> >>recover >> > > > > >> >> >> >> the >> > > > > >> >> >> >> > > >> > launcher >> > > > > >> >> >> >> > > >> > > > job >> > > > > >> >> >> >> > > >> > > > > > but >> > > > > >> >> >> >> > > >> > > > > > > > not the child jobs -- the >> launcher >> > job >> > > > > >>will >> > > > > >> >>then >> > > > > >> >> >> >>take >> > > > > >> >> >> >> > > >>care of >> > > > > >> >> >> >> > > >> > > > > > relaunching >> > > > > >> >> >> >> > > >> > > > > > > > the child jobs. >> > > > > >> >> >> >> > > >> > > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > > For MapReduce, because of the >> > > > optimization >> > > > > >> >>with >> > > > > >> >> >> >>the id >> > > > > >> >> >> >> > > >>swap, >> > > > > >> >> >> >> > > >> > this >> > > > > >> >> >> >> > > >> > > > > won't >> > > > > >> >> >> >> > > >> > > > > > > > work. It would be very tricky, >> if >> > > it's >> > > > > >>even >> > > > > >> >> >> >> practical, >> > > > > >> >> >> >> > > >>to do >> > > > > >> >> >> >> > > >> > > > > something >> > > > > >> >> >> >> > > >> > > > > > > > similar for the MR action. >> Instead, >> > > we >> > > > > >> >>think it >> > > > > >> >> >> >>would >> > > > > >> >> >> >> > be >> > > > > >> >> >> >> > > >> best >> > > > > >> >> >> >> > > >> > if >> > > > > >> >> >> >> > > >> > > > we >> > > > > >> >> >> >> > > >> > > > > > > simply >> > > > > >> >> >> >> > > >> > > > > > > > remove the MR optimization and >> make >> > it >> > > > > >>just >> > > > > >> >>like >> > > > > >> >> >> >>the >> > > > > >> >> >> >> > other >> > > > > >> >> >> >> > > >> > action >> > > > > >> >> >> >> > > >> > > > > > types. >> > > > > >> >> >> >> > > >> > > > > > > I >> > > > > >> >> >> >> > > >> > > > > > > > know we normally don't want to >> > remove >> > > > > >> >> >> >>optimizations, >> > > > > >> >> >> >> but >> > > > > >> >> >> >> > > >> there >> > > > > >> >> >> >> > > >> > > are >> > > > > >> >> >> >> > > >> > > > > many >> > > > > >> >> >> >> > > >> > > > > > > > advantages in this case, and it's >> > only >> > > > > >> >>saving a >> > > > > >> >> >> >>single >> > > > > >> >> >> >> > Map >> > > > > >> >> >> >> > > >> slot >> > > > > >> >> >> >> > > >> > > for >> > > > > >> >> >> >> > > >> > > > > MR >> > > > > >> >> >> >> > > >> > > > > > > jobs >> > > > > >> >> >> >> > > >> > > > > > > > only. >> > > > > >> >> >> >> > > >> > > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > > I've created OOZIE-1483 < >> > > > > >> >> >> >> > > >> > > > > > > >> > > > > >> >>https://issues.apache.org/jira/browse/OOZIE-1483> >> > > > > >> >> >> >> > > >> > > > > > > > with >> > > > > >> >> >> >> > > >> > > > > > > > more details and should have a >> patch >> > > > soon. >> > > > > >> >> >> >> > > >> > > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > > Thoughts? >> > > > > >> >> >> >> > > >> > > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > > thanks >> > > > > >> >> >> >> > > >> > > > > > > > - Robert >> > > > > >> >> >> >> > > >> > > > > > > > >> > > > > >> >> >> >> > > >> > > > > > > >> > > > > >> >> >> >> > > >> > > > > > >> > > > > >> >> >> >> > > >> > > > > >> > > > > >> >> >> >> > > >> > > > >> > > > > >> >> >> >> > > >> > > >> > > > > >> >> >> >> > > >> > >> > > > > >> >> >> >> > > >> >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > >> > > > > >> >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> > >> > > > > >> > >> > > > > >> >-- >> > > > > >> >Alejandro >> > > > > >> >> > > > > >> >> > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Alejandro >> > >> > >