[
https://issues.apache.org/jira/browse/TEZ-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182300#comment-14182300
]
Siddharth Seth commented on TEZ-1689:
-------------------------------------
Comments on the patch.
- For all AMUserCodeExceptions thrown from edge events, it will be useful to
include the source and destination vertices which are connected on the Edge.
AMUserCodeException could have an additional diagnosticMessage field which
could be populated and logged each time.
- This is a question. Should setParallelism be throwing Exception or
AMUserCodeException ?
- edge.setCustomEdgeManager - returning false. This isn't very useful in
general, since none of the VertexManagers actually do anything with the return
value. Instead, we should just throw a RuntimeException - which would
eventually result in a VertexManager failure and will result in the DAG failing.
- Nit; {code}+ boolean successSetParallelism2; {code} Use
successSetParallelism after reset.
- DAGImpl changes - error reporting should be using getCause()
- // ingore this exception, should not happen| If this is meant to be ignored,
the log line should indicate this. Unexpected error, etc.
- {code}+ // this method would been called by all the task attempts,
+ // should always throw exception in the subsequent task attempt if
+ // exception happen in the first task attempt.{code}
Should the exception be sent to all tasks if there was routing failure for one
of them ? Instead, should we just let the single task for which there was an
error FAIL the DAG via regular state machine semantics ? That isolates the one
which caused the failure as well.
On the InputInitializer,
- For handleInputInitializerEvent - this should be fairly straightfoward to
handle. At the moment this is an inline call from within the AsyncDispatcher,
and will end up causing a RuntimeException. The RuntimeException can be changed
to a AMUserCodeException which will take care of this.
- For onVertexStateUpdated, this eventually gets invoked from within
RootInputInitializerManager. Catching exceptions there and sending a
RootInputInitialzierFailedEvent should be enough to fix this ? May require some
state machine changes to handle this event on a few more states.
I think it'll be better to do the initializer changes in a separate jira, so
that this can go in quickly (easier to review subsequent patches from the
current patch :))
> Exception handling for EdgeManagePlugin and InputInitializer
> ------------------------------------------------------------
>
> Key: TEZ-1689
> URL: https://issues.apache.org/jira/browse/TEZ-1689
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Priority: Critical
> Attachments: TEZ-1689.patch
>
>
> EdgeManagePlugin and InputInitializer are both user code which could lead
> exception, we should handle it, fail the DAG and display the exception in
> client side.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)