Steve Loughran created SLIDER-870:
-------------------------------------
Summary: use timeline server as a historical source of failure
information
Key: SLIDER-870
URL: https://issues.apache.org/jira/browse/SLIDER-870
Project: Slider
Issue Type: Sub-task
Components: appmaster, client
Affects Versions: Slider 0.80
Reporter: Steve Loughran
We lose failure history when an AM dies; this hurts reporting and doesn't allow
the collection of long-term statistics.
We can use the timeline server for this information, saving events on failure,
then querying it on AM restart to rebuild that history & re-use it in decision
making.
They can also be presented to the user in (a) the web UI and (b) from the
command line —even while a cluster is not running.
Finally, stats on node failures could be aggregated across applications,
possibly even across users. This would identify hotspots for node unreliability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)