[ 
https://issues.apache.org/jira/browse/TEZ-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149622#comment-15149622
 ] 

Siddharth Seth commented on TEZ-3029:
-------------------------------------

bq. processNonFatalServiceErrorReport does not do anything for the case where 
an error is reported for a dag that does not match the current one. Silently 
ignored? Also, should this API ever be allowed to be invoked with a null 
daginfo object? For now, both the dagInfo null and the non-matching dags can be 
logged as a warn.
The intent here was to verify the dagId reported by the plugin against the 
currently running dagId in case of 'non-fatal' errors, before taking any action 
- to avoid a race in killing / notifying an incorrect dag in case the previous 
one completed.
In case of a 'fatal' error - we don't really care about whether dagInfo is 
populated or not.
Removing dagInfo altogether is an option. For now, I'll log these cases.

bq. DAG_TERMINATED in DAGTerminationCause seems ambiguous as it implies killed 
by someone. This could be retained for killed by the user/client/yarn. For 
service plugins, why not just add a termination cause denoting service plugin 
error ( either one or 3 different errors ) to make it more clear. A similar 
change might be needed for VertexTerminationCause. A service plugin error 
should still fail the dag. It is not a kill as a kill indicates no problems 
happened and the dag was just interrupted for some reason whatsoever.
Renamed DAG_TERMINATED to SERVICE_PLUGIN_ERROR. On the Vertex itself, I think 
it's ok to leave it as KILLED - with information available at the dag level.

bq. diagnostics in DAGAppMasterEventSchedulingServiceError - is this expected 
to be a string + exception stack trace?
Yes

bq. dont like the idea of the incomplete ctor var - does this really need to be 
there? Also, can "public ContainerLauncherManager(AppContext context)" be just 
made package private for testing?
Assuming you mean ContainerLauncherManager. I don't like incomplete ctor vars 
eithers - but this makes it much simpler to test. Would prefer not changing 
this in this jira. In terms of package access, that can be done by subclassing 
for the MockDagAppMaster - which is in a different package, and hence public. 
If you feel strongly about this, I'll add a subclass to make it package access.

bq. // KKK Remove deprecated methods - maybe replace the KKK with a TODO to 
avoid any acronym confusion?
That was a reminder to me. Was thinking of making this change in this patch, 
but I think I'll defer it to just before a 0.8.3 release.

> Add an onError method to service plugin contexts
> ------------------------------------------------
>
>                 Key: TEZ-3029
>                 URL: https://issues.apache.org/jira/browse/TEZ-3029
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-3029.1.txt, TEZ-3029.2.txt, TEZ-3029.3.txt
>
>
> This is to indicate errors which may occur while running threads etc.
> One bit to be careful about is that this introduces two means of reporting 
> errors - which has been challenging to handle in the past for 
> Input/Processor/OutputContext - so whether to do this or not needs to be 
> thought through.
> TaskSchedulerContext already has this method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to