I already had yarn.resourcemanager.am.max-retries set but was still
failing. Turns the parameter is actually implemented as
yarn.resourcemanager.am.max-attempts  - Jira
https://issues.apache.org/jira/browse/YARN-611 has old name :)

Thanks anyway !

Manoj

On Tue, Jan 5, 2016 at 3:01 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> > On 5 Jan 2016, at 00:51, Manoj Samel <manojsamelt...@gmail.com> wrote:
> >
> > Hi,
> >
> > Slider version .80 on secured cluster.
> >
> > I am seeing the pattern consistently
> >
> > 1. Create cluster with just slider AM. Works fine
> > 2. Slider upgrade to add 1 component. Works fine
> > 3. Slider upgrade to add 2nd component. After upgrade, the application
> goes
> > in FAILED state.
> > I start it manually using slider start
> > 4. Add 3rd component. Works fine
> > 5. Add 4th component. Application goes to FAILED state again. Repeats
> again
> >
> > After addition of every 2nd component, the application fails and has to
> be
> > started again.
> >
> > The slider AM log does not contains any error etc. It only contains
> > following
> >
> > 2016-01-05 00:01:42,839 [Socket Reader #1 for port 1024] INFO
> > authorize.ServiceAuthorizationManager - Authorization successful for XYZ
> > (auth:TOKEN) for protocol=interface
> > org.apache.slider.server.appmaster.rpc.SliderClusterProtocolPB
> > 2016-01-05 00:01:42,846 [IPC Server handler 1 on 1024] INFO
> > rpc.SliderIPCService - AM Suicide with signal 1, message AM restarted for
> > application upgrade delay = 1000
> > 2016-01-05 00:01:43,847 [AmExecutor-006] INFO  util.ExitUtil - Halt with
> > status 1 Message: AM restarted for application upgrade
> >
>
>
> The AM deliberately restarts itself on an upgrade, so YARN sets out the
> new values.
>
>  I can see we need to add more details to the log here -I've just created
> SLIDER-1043 for that.
>
> What's triggering the failure is that your cluster is set to be fairly
> aggressive about AM failures, as soon as you fail more than once, your app
> is treated as failing repeatedly and killing it.
>
> 1. You can increment the failure threshold with
> yarn.resourcemanager.am.max-retries  (in yarn-site.xml)
> 2. you can set a failure reset window when you launch an app, a feature of
> YARN-611
>
> Slider 0.90.2 has added support for the reset window in SLIDER-930; in
> resources.json, set the global option
> "yarn.resourcemanager.am.retry-count-window-ms" to the value you want:
>
> "yarn.resourcemanager.am.retry-count-window-ms": "300000"
>
> What slider could really do with is for YARN to have some exit code which
> we could issue to say "restart us and don't treat this as a failure".
> YARN-3417 proposes that —nobody has implemented it, and if they did, it
> wouldn't ship until Hadoop 2.9
>
> If its just cluster resizing you are trying to do, use the "slider flex"
> command. If you are deliberately triggering failures (via slider upgrade or
> slider am-suicide), you may want to set the reset window.
>
> Even if you havent upgraded to 0.90.2, if you can persuade your cluster
> admins to increment the yarn.resourcemanager.am.max-retries  value to
> something a bit bigger, you'll encounter less problems on updates and AM
> failure. I'd recommend 5 or more.
>
> -Steve
>
>
>
>

Reply via email to