[
https://issues.apache.org/jira/browse/SLIDER-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16184609#comment-16184609
]
Gour Saha commented on SLIDER-1246:
-----------------------------------
A sample resources json for hive is -
{code}
{
"schema" : "http://example.org/specification/v2.0.0",
"metadata" : { },
"global" : {
"yarn.log.include.patterns" : ".*\\.done"
},
"credentials" : { },
"components" : {
"LLAP" : {
"yarn.role.priority" : "1",
"yarn.component.instances" : "5",
"yarn.memory" : "10240",
"yarn.component.placement.policy" : "0",
"yarn.resource.normalization.enabled" : "false",
"yarn.container.health.threshold.percent" : "80", // 80%
"yarn.container.health.threshold.window.secs" : "600", // acceptable to
be below 80% for up to 10 mins at a stretch
"yarn.container.health.threshold.init.delay.secs" : "400" // additional
lead time of 400 secs before the threshold monitor kicks in to do its job
},
"slider-appmaster" : {
"yarn.vcores" : "1",
"yarn.component.instances" : "1",
"yarn.memory" : "1024"
}
}
}
{code}
> Application health should not be affected by faulty nodes
> ---------------------------------------------------------
>
> Key: SLIDER-1246
> URL: https://issues.apache.org/jira/browse/SLIDER-1246
> Project: Slider
> Issue Type: Bug
> Components: appmaster, core
> Affects Versions: Slider 0.92
> Reporter: Prasanth Jayachandran
> Assignee: Gour Saha
> Fix For: Slider 1.0.0
>
> Attachments: SLIDER-1246.01.patch
>
>
> In case of a faulty node, multiple container failures will be deemed as an
> application failure.
> Observed this in HIVE-16927, where container failures in certain nodes brings
> down entire application. Slider has to provide a way to not mark application
> as unhealthy if certain threshold of containers are running. Tuning failure
> threshold is not optimal as setting the correct default on large cluster is
> not trivial. Beyond certain failures, slider should mark the node as
> unhealthy and report that back to client/AM. Application could continue to
> run as long as container request is satisfied partially (example: 80%
> containers are running).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)