> On 5 Jan 2016, at 00:51, Manoj Samel <manojsamelt...@gmail.com> wrote:
> 
> Hi,
> 
> Slider version .80 on secured cluster.
> 
> I am seeing the pattern consistently
> 
> 1. Create cluster with just slider AM. Works fine
> 2. Slider upgrade to add 1 component. Works fine
> 3. Slider upgrade to add 2nd component. After upgrade, the application goes
> in FAILED state.
> I start it manually using slider start
> 4. Add 3rd component. Works fine
> 5. Add 4th component. Application goes to FAILED state again. Repeats again
> 
> After addition of every 2nd component, the application fails and has to be
> started again.
> 
> The slider AM log does not contains any error etc. It only contains
> following
> 
> 2016-01-05 00:01:42,839 [Socket Reader #1 for port 1024] INFO
> authorize.ServiceAuthorizationManager - Authorization successful for XYZ
> (auth:TOKEN) for protocol=interface
> org.apache.slider.server.appmaster.rpc.SliderClusterProtocolPB
> 2016-01-05 00:01:42,846 [IPC Server handler 1 on 1024] INFO
> rpc.SliderIPCService - AM Suicide with signal 1, message AM restarted for
> application upgrade delay = 1000
> 2016-01-05 00:01:43,847 [AmExecutor-006] INFO  util.ExitUtil - Halt with
> status 1 Message: AM restarted for application upgrade
> 


The AM deliberately restarts itself on an upgrade, so YARN sets out the new 
values.

 I can see we need to add more details to the log here -I've just created 
SLIDER-1043 for that.

What's triggering the failure is that your cluster is set to be fairly 
aggressive about AM failures, as soon as you fail more than once, your app is 
treated as failing repeatedly and killing it.

1. You can increment the failure threshold with 
yarn.resourcemanager.am.max-retries  (in yarn-site.xml)
2. you can set a failure reset window when you launch an app, a feature of 
YARN-611

Slider 0.90.2 has added support for the reset window in SLIDER-930; in 
resources.json, set the global option  
"yarn.resourcemanager.am.retry-count-window-ms" to the value you want:

"yarn.resourcemanager.am.retry-count-window-ms": "300000"

What slider could really do with is for YARN to have some exit code which we 
could issue to say "restart us and don't treat this as a failure". YARN-3417 
proposes that —nobody has implemented it, and if they did, it wouldn't ship 
until Hadoop 2.9

If its just cluster resizing you are trying to do, use the "slider flex" 
command. If you are deliberately triggering failures (via slider upgrade or 
slider am-suicide), you may want to set the reset window. 

Even if you havent upgraded to 0.90.2, if you can persuade your cluster admins 
to increment the yarn.resourcemanager.am.max-retries  value to something a bit 
bigger, you'll encounter less problems on updates and AM failure. I'd recommend 
5 or more.

-Steve



Reply via email to