I already had yarn.resourcemanager.am.max-retries set but was still failing. Turns the parameter is actually implemented as yarn.resourcemanager.am.max-attempts - Jira https://issues.apache.org/jira/browse/YARN-611 has old name :)
Thanks anyway ! Manoj On Tue, Jan 5, 2016 at 3:01 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > > On 5 Jan 2016, at 00:51, Manoj Samel <manojsamelt...@gmail.com> wrote: > > > > Hi, > > > > Slider version .80 on secured cluster. > > > > I am seeing the pattern consistently > > > > 1. Create cluster with just slider AM. Works fine > > 2. Slider upgrade to add 1 component. Works fine > > 3. Slider upgrade to add 2nd component. After upgrade, the application > goes > > in FAILED state. > > I start it manually using slider start > > 4. Add 3rd component. Works fine > > 5. Add 4th component. Application goes to FAILED state again. Repeats > again > > > > After addition of every 2nd component, the application fails and has to > be > > started again. > > > > The slider AM log does not contains any error etc. It only contains > > following > > > > 2016-01-05 00:01:42,839 [Socket Reader #1 for port 1024] INFO > > authorize.ServiceAuthorizationManager - Authorization successful for XYZ > > (auth:TOKEN) for protocol=interface > > org.apache.slider.server.appmaster.rpc.SliderClusterProtocolPB > > 2016-01-05 00:01:42,846 [IPC Server handler 1 on 1024] INFO > > rpc.SliderIPCService - AM Suicide with signal 1, message AM restarted for > > application upgrade delay = 1000 > > 2016-01-05 00:01:43,847 [AmExecutor-006] INFO util.ExitUtil - Halt with > > status 1 Message: AM restarted for application upgrade > > > > > The AM deliberately restarts itself on an upgrade, so YARN sets out the > new values. > > I can see we need to add more details to the log here -I've just created > SLIDER-1043 for that. > > What's triggering the failure is that your cluster is set to be fairly > aggressive about AM failures, as soon as you fail more than once, your app > is treated as failing repeatedly and killing it. > > 1. You can increment the failure threshold with > yarn.resourcemanager.am.max-retries (in yarn-site.xml) > 2. you can set a failure reset window when you launch an app, a feature of > YARN-611 > > Slider 0.90.2 has added support for the reset window in SLIDER-930; in > resources.json, set the global option > "yarn.resourcemanager.am.retry-count-window-ms" to the value you want: > > "yarn.resourcemanager.am.retry-count-window-ms": "300000" > > What slider could really do with is for YARN to have some exit code which > we could issue to say "restart us and don't treat this as a failure". > YARN-3417 proposes that —nobody has implemented it, and if they did, it > wouldn't ship until Hadoop 2.9 > > If its just cluster resizing you are trying to do, use the "slider flex" > command. If you are deliberately triggering failures (via slider upgrade or > slider am-suicide), you may want to set the reset window. > > Even if you havent upgraded to 0.90.2, if you can persuade your cluster > admins to increment the yarn.resourcemanager.am.max-retries value to > something a bit bigger, you'll encounter less problems on updates and AM > failure. I'd recommend 5 or more. > > -Steve > > > >