[
https://issues.apache.org/jira/browse/SLIDER-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289050#comment-14289050
]
Steve Loughran commented on SLIDER-764:
---------------------------------------
That log looks like it's handling the node status events from the RM, which
does include a node health check. In that situation, we don't currently act on
it other than updating our node map of failed nodes. We'll get the real
container events from the RM when it decides they are actually failed.
What you are really looking for is some information when the *application*
doesn't consider itself live.
Which raises the question: how do you define liveness?"
We could start with some criteria for the minimum number of each component type
(e.g. 1 x master, 1 x region server, 1x thrift, 0x Rest) and add that to
resources.json.
That would suffice to at least say the system isn't up when the thresholds
aren't met.
This could be followed on with proper liveness tests to verify that deployed
services are actually opening ports, responding to HTTP requests etc.
-steve
> Provide proper response when number of good nodes is lower than requested
> number of components
> ----------------------------------------------------------------------------------------------
>
> Key: SLIDER-764
> URL: https://issues.apache.org/jira/browse/SLIDER-764
> Project: Slider
> Issue Type: Task
> Reporter: Ted Yu
>
> While debugging a Slider-hbase deployment problem where client retrieved
> hbase-site.xml but verification of region server count failed, I found the
> following in SliderAppMaster log:
> {code}
> 2015-01-19 11:19:57,318 [AMRM Callback Handler Thread] INFO
> appmaster.SliderAppMaster (SliderAppMaster.java:onNodesUpdated(1603)) -
> Updated nodes [nodeId { host:
> "os-h2-2210-d6-sec-1421653828-hbase-slider3-1.hw.local" port: 45454 }
> httpAddress: "os-h2-2210-d6-sec-1421653828-hbase-slider3-1.hw.local:8044"
> rackName: "/default-rack" used { memory: 0 virtual_cores: 0 }
> capability { memory: 10240 virtual_cores: 8 } node_state: NS_UNHEALTHY
> health_report: "2/2 local-dirs are bad:
> /grid/0/yarn/local,/grid/1/yarn/local; 2/2 log- dirs are bad:
> /grid/0/yarn/log,/grid/1/yarn/log" last_health_report_time: 1421666370462]
> {code}
> In case there're not enough good nodes where requested number of components
> (such as region server) can be deployed, Slider shouldn't signal deployment
> success.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)