David, When local disks on the host running node manager are more than 90% full, nodemanager gives message like "10/12 local-dirs are bad:". In such cases, the node manager service keeps running but is not servicing any applications.
Check if the host had multiple disk more than 90% full. Hope this helps ! Manoj On Tue, Apr 3, 2018 at 10:59 PM, Gour Saha <gs...@hortonworks.com> wrote: > Can you check the slider agent logs and the application logs in those > containers to see if they are failing with some exception? > > The fishy thing I found in the AM log are messages like these saying > "local-dirs are bad". Can you check what's going on with these dirs.? > > 2018-04-03 18:38:28,200 [AMRM Callback Handler Thread] INFO > appmaster.SliderAppMaster - onNodesUpdated(1) > 2018-04-03 18:38:28,376 [AMRM Callback Handler Thread] INFO > appmaster.SliderAppMaster - Updated nodes [nodeId { host: "***" port: 45454 > } httpAddress: "***:8042" rackName: "/EI105" used { memory: 0 > virtual_cores: 0 } capability { memory: 364544 virtual_cores: 38 } > node_state: NS_UNHEALTHY health_report: "10/12 local-dirs are bad: > /grid/9/hadoop/yarn/local,/grid/2/hadoop/yarn/local,/ > grid/1/hadoop/yarn/local,/grid/5/hadoop/yarn/local,/ > grid/11/hadoop/yarn/local,/grid/3/hadoop/yarn/local,/ > grid/8/hadoop/yarn/local,/grid/6/hadoop/yarn/local,/ > grid/0/hadoop/yarn/local,/grid/7/hadoop/yarn/local; 10/12 log-dirs are > bad: /grid/6/hadoop/yarn/log,/grid/8/hadoop/yarn/log,/grid/2/ > hadoop/yarn/log,/grid/1/hadoop/yarn/log,/grid/5/hadoop/yarn/log,/grid/11/ > hadoop/yarn/log,/grid/7/hadoop/yarn/log,/grid/9/hadoop/yarn/log,/grid/0/ > hadoop/yarn/log,/grid/3/hadoop/yarn/log" last_health_report_time: > 1522798707678] > > -Gour > > On 4/3/18, 10:49 PM, "David.Serafini" <david.seraf...@target.com> wrote: > > I've attached what I can find. > > > On 4/3/18, 10:38 PM, Gour Saha <gs...@hortonworks.com> wrote: > > Can you share the logs of the dying containers and the AM to debug > further? > > -Gour > > On 4/3/18, 6:49 PM, "David.Serafini" <david.seraf...@target.com> > wrote: > > I've been using slider 0.91 for a year and it's been very > stable lately. > I built 0.92 to test it and my yarn containers are dying after > 10 minutes. > Slider restarts them successfully, but this isn't acceptable > behavior. > Any thoughts on what could be going on? > > I looked for some kind of release notes for 0.92, but didn't > find anything except a list of ticket ids. > Is there some configuration in my job that I should have > changed to use 0.92? > > Thanks, > -david > > > > > > > > >