[
https://issues.apache.org/jira/browse/TEZ-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194133#comment-15194133
]
Siddharth Seth commented on TEZ-3164:
-------------------------------------
Big +1 for doing this.
An external script could be used for such diagnostics, but Tez, MR etc will
likely already have a lot of this information from running jobs.
> Surface error histograms from the AM
> ------------------------------------
>
> Key: TEZ-3164
> URL: https://issues.apache.org/jira/browse/TEZ-3164
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Bikas Saha
>
> Job tasks are constantly probing the cluster. So if there are some issues in
> the cluster then jobs would be the first to notice that. If we can make these
> observations surface to the user then we could quickly identify cluster
> issues.
> Lets say a set of bad machines got added to the cluster and tasks started
> seeing shuffle errors from those machines. This can slow down or hang the
> job. If the AM can surface increased errors counts from source and
> destination machines then that could pin point the bad machines vs having to
> arrive at those machines from first principles and log searching.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)