Steve Loughran created SLIDER-870:
-------------------------------------

             Summary: use timeline server as a historical source of failure 
information
                 Key: SLIDER-870
                 URL: https://issues.apache.org/jira/browse/SLIDER-870
             Project: Slider
          Issue Type: Sub-task
          Components: appmaster, client
    Affects Versions: Slider 0.80
            Reporter: Steve Loughran


We lose failure history when an AM dies; this hurts reporting and doesn't allow 
the collection of long-term statistics.

We can use the timeline server for this information, saving events on failure, 
then querying it on AM restart to rebuild that history & re-use it in decision 
making. 

They can also be presented to the user in (a) the web UI and (b) from the 
command line —even while a cluster is not running.

Finally, stats on node failures could be aggregated across applications, 
possibly even across users. This would identify hotspots for node unreliability.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to