Bikas Saha created TEZ-3164:
-------------------------------
Summary: Surface error histograms from the AM
Key: TEZ-3164
URL: https://issues.apache.org/jira/browse/TEZ-3164
Project: Apache Tez
Issue Type: Improvement
Reporter: Bikas Saha
Job tasks are constantly probing the cluster. So if there are some issues in
the cluster then jobs would be the first to notice that. If we can make these
observations surface to the user then we could quickly identify cluster issues.
Lets say a set of bad machines got added to the cluster and tasks started
seeing shuffle errors from those machines. This can slow down or hang the job.
If the AM can surface increased errors counts from source and destination
machines then that could pin point the bad machines vs having to arrive at
those machines from first principles and log searching.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)