[
https://issues.apache.org/jira/browse/TEZ-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149622#comment-15149622
]
Siddharth Seth commented on TEZ-3029:
-------------------------------------
bq. processNonFatalServiceErrorReport does not do anything for the case where
an error is reported for a dag that does not match the current one. Silently
ignored? Also, should this API ever be allowed to be invoked with a null
daginfo object? For now, both the dagInfo null and the non-matching dags can be
logged as a warn.
The intent here was to verify the dagId reported by the plugin against the
currently running dagId in case of 'non-fatal' errors, before taking any action
- to avoid a race in killing / notifying an incorrect dag in case the previous
one completed.
In case of a 'fatal' error - we don't really care about whether dagInfo is
populated or not.
Removing dagInfo altogether is an option. For now, I'll log these cases.
bq. DAG_TERMINATED in DAGTerminationCause seems ambiguous as it implies killed
by someone. This could be retained for killed by the user/client/yarn. For
service plugins, why not just add a termination cause denoting service plugin
error ( either one or 3 different errors ) to make it more clear. A similar
change might be needed for VertexTerminationCause. A service plugin error
should still fail the dag. It is not a kill as a kill indicates no problems
happened and the dag was just interrupted for some reason whatsoever.
Renamed DAG_TERMINATED to SERVICE_PLUGIN_ERROR. On the Vertex itself, I think
it's ok to leave it as KILLED - with information available at the dag level.
bq. diagnostics in DAGAppMasterEventSchedulingServiceError - is this expected
to be a string + exception stack trace?
Yes
bq. dont like the idea of the incomplete ctor var - does this really need to be
there? Also, can "public ContainerLauncherManager(AppContext context)" be just
made package private for testing?
Assuming you mean ContainerLauncherManager. I don't like incomplete ctor vars
eithers - but this makes it much simpler to test. Would prefer not changing
this in this jira. In terms of package access, that can be done by subclassing
for the MockDagAppMaster - which is in a different package, and hence public.
If you feel strongly about this, I'll add a subclass to make it package access.
bq. // KKK Remove deprecated methods - maybe replace the KKK with a TODO to
avoid any acronym confusion?
That was a reminder to me. Was thinking of making this change in this patch,
but I think I'll defer it to just before a 0.8.3 release.
> Add an onError method to service plugin contexts
> ------------------------------------------------
>
> Key: TEZ-3029
> URL: https://issues.apache.org/jira/browse/TEZ-3029
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Siddharth Seth
> Assignee: Siddharth Seth
> Attachments: TEZ-3029.1.txt, TEZ-3029.2.txt, TEZ-3029.3.txt
>
>
> This is to indicate errors which may occur while running threads etc.
> One bit to be careful about is that this introduces two means of reporting
> errors - which has been challenging to handle in the past for
> Input/Processor/OutputContext - so whether to do this or not needs to be
> thought through.
> TaskSchedulerContext already has this method.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)