[ 
https://issues.apache.org/jira/browse/TEZ-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-3028:
--------------------------------
    Description: 
There's several places where exceptions can reach the Dispatcher - which can 
cause a restart. Some of these may be valid and need to be evaluated.
e.g. TaskCommunicatorManager tracks known containers etc. In case of an error - 
it throws an unchecked exception, which I believe will reach the dispatcher 
directly. (Something like this happening would indicate a bug in the 
framework). Should this trigger a restart of the AM - or shutting down with an 
internal error?

The TaskSchedulerManager handles exceptions while processing events and 
dispatches a generic INTERNAL_ERRROR to the DAGAppMaster. This can be augmented 
with the reason for the error so that diagnostics are displayed correctly (or 
at least posted to the history service)

Also, what should be done when an exception does reach the Dispatcher.

  was:
There's several places where exceptions can reach the Dispatcher - which can 
cause a restart. Some of these may be valid and need to be evaluated.
e.g. TaskCommunicatorManager tracks known containers etc. In case of an error - 
it throws an unchecked exception, which I believe will reach the dispatcher 
directly. (Something like this happening would indicate a bug in the 
framework). Should this trigger a restart of the AM - or shutting down with an 
internal error?

The TaskSchedulerManager handles exceptions while processing events and 
dispatches a generic INTERNAL_ERRROR to the DAGAppMaster. This can be augmented 
with the reason for the error so that diagnostics are displayed correctly (or 
at least posted to the history service)


> Improvements to error handling
> ------------------------------
>
>                 Key: TEZ-3028
>                 URL: https://issues.apache.org/jira/browse/TEZ-3028
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Siddharth Seth
>
> There's several places where exceptions can reach the Dispatcher - which can 
> cause a restart. Some of these may be valid and need to be evaluated.
> e.g. TaskCommunicatorManager tracks known containers etc. In case of an error 
> - it throws an unchecked exception, which I believe will reach the dispatcher 
> directly. (Something like this happening would indicate a bug in the 
> framework). Should this trigger a restart of the AM - or shutting down with 
> an internal error?
> The TaskSchedulerManager handles exceptions while processing events and 
> dispatches a generic INTERNAL_ERRROR to the DAGAppMaster. This can be 
> augmented with the reason for the error so that diagnostics are displayed 
> correctly (or at least posted to the history service)
> Also, what should be done when an exception does reach the Dispatcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to