> On 5 Jan 2016, at 00:51, Manoj Samel <manojsamelt...@gmail.com> wrote: > > Hi, > > Slider version .80 on secured cluster. > > I am seeing the pattern consistently > > 1. Create cluster with just slider AM. Works fine > 2. Slider upgrade to add 1 component. Works fine > 3. Slider upgrade to add 2nd component. After upgrade, the application goes > in FAILED state. > I start it manually using slider start > 4. Add 3rd component. Works fine > 5. Add 4th component. Application goes to FAILED state again. Repeats again > > After addition of every 2nd component, the application fails and has to be > started again. > > The slider AM log does not contains any error etc. It only contains > following > > 2016-01-05 00:01:42,839 [Socket Reader #1 for port 1024] INFO > authorize.ServiceAuthorizationManager - Authorization successful for XYZ > (auth:TOKEN) for protocol=interface > org.apache.slider.server.appmaster.rpc.SliderClusterProtocolPB > 2016-01-05 00:01:42,846 [IPC Server handler 1 on 1024] INFO > rpc.SliderIPCService - AM Suicide with signal 1, message AM restarted for > application upgrade delay = 1000 > 2016-01-05 00:01:43,847 [AmExecutor-006] INFO util.ExitUtil - Halt with > status 1 Message: AM restarted for application upgrade >
The AM deliberately restarts itself on an upgrade, so YARN sets out the new values. I can see we need to add more details to the log here -I've just created SLIDER-1043 for that. What's triggering the failure is that your cluster is set to be fairly aggressive about AM failures, as soon as you fail more than once, your app is treated as failing repeatedly and killing it. 1. You can increment the failure threshold with yarn.resourcemanager.am.max-retries (in yarn-site.xml) 2. you can set a failure reset window when you launch an app, a feature of YARN-611 Slider 0.90.2 has added support for the reset window in SLIDER-930; in resources.json, set the global option "yarn.resourcemanager.am.retry-count-window-ms" to the value you want: "yarn.resourcemanager.am.retry-count-window-ms": "300000" What slider could really do with is for YARN to have some exit code which we could issue to say "restart us and don't treat this as a failure". YARN-3417 proposes that —nobody has implemented it, and if they did, it wouldn't ship until Hadoop 2.9 If its just cluster resizing you are trying to do, use the "slider flex" command. If you are deliberately triggering failures (via slider upgrade or slider am-suicide), you may want to set the reset window. Even if you havent upgraded to 0.90.2, if you can persuade your cluster admins to increment the yarn.resourcemanager.am.max-retries value to something a bit bigger, you'll encounter less problems on updates and AM failure. I'd recommend 5 or more. -Steve