[ 
https://issues.apache.org/jira/browse/TEZ-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194133#comment-15194133
 ] 

Siddharth Seth commented on TEZ-3164:
-------------------------------------

Big +1 for doing this.
An external script could be used for such diagnostics, but Tez, MR etc will 
likely already have a lot of this information from running jobs.

> Surface error histograms from the AM
> ------------------------------------
>
>                 Key: TEZ-3164
>                 URL: https://issues.apache.org/jira/browse/TEZ-3164
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>
> Job tasks are constantly probing the cluster. So if there are some issues in 
> the cluster then jobs would be the first to notice that. If we can make these 
> observations surface to the user then we could quickly identify cluster 
> issues.
> Lets say a set of bad machines got added to the cluster and tasks started 
> seeing shuffle errors from those machines. This can slow down or hang the 
> job. If the AM can surface increased errors counts from source and 
> destination machines then that could pin point the bad machines vs having to 
> arrive at those machines from first principles and log searching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to