Steve Loughran created SLIDER-758:
-------------------------------------

             Summary: Slider placement requests to skip unreliable nodes
                 Key: SLIDER-758
                 URL: https://issues.apache.org/jira/browse/SLIDER-758
             Project: Slider
          Issue Type: Improvement
          Components: appmaster
    Affects Versions: Slider 0.60
            Reporter: Steve Loughran
            Assignee: Steve Loughran
             Fix For: Slider 0.70


As discussed on the developer list; slider's "prefer previously used nodes" is 
biased towards recently used nodes —even when those nodes are failing to 
successfully launch containers. 

As we already track node failure rates, the placement logic can be enhanced to 
not generate "placed" requests on nodes with a (recent)  failure history of 
that component type.

The initial iteration of this feature will not use the YARN blacklisting APIs, 
instead build up history in the AM, history that will be lost on AM restart. 
Accordingly, even unplaced requests may end being scheduled on the unreliable 
nodes.

This strategy (which we could revisit in future), combined with a regular reset 
of the failure counters, stops slider blacklisting nodes whose failure rate was 
high some time previously —but which is now reliable again.

Testing: primarily via mocking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to