[
https://issues.apache.org/jira/browse/TEZ-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177597#comment-14177597
]
Siddharth Seth commented on TEZ-1267:
-------------------------------------
Structurally, I think this looks good. There are some things that need to be
considered though
- ROUTE_EVENT_TRANSITIONS from the NEW / INITIALIZING / INITED state. This,
generally, will not process the relevant event - except for the case of
VertexManagerEvents. I think it's safe to leave the change as is - i.e. allow a
transition into FAILED state, but I don't think it'll be triggered, at least
not for NEW. VertexManagers aren't setup before the NEW state - and we likely
need to put in checks for this before - as part of a separate jira.
- SOURCE_TASK_ATTEMPT_COMPLETE while in NEW / INITIALIZING / INITED - the
events will always be cached, and hence should not lead to problems. We could
choose to leave these transitions as is, or allow the FAILED state.
- Transition from INITED to RUNNING - it's possible for the VertexManager to
have scheduled tasks before generating an error. In such cases, I think we need
to try killing any invoked tasks - rather than transitioning directly to FAILED
state. (Technically, the schedule could be in any of the states - but it
shouldn't be used before the vertex starts.
- Minor: Log messages say ", vertexId=" + vertex.logIdentifier + ",". This
should just be vertex instead of vertexId, since logIndetifier contains both.
These changes are required for the EdgePlugins as well, and also for
InputInitializers. I think we should get this in, and handle the Edges and
InputInitializers in a separate jira.
> Exception handling when Routing Events
> --------------------------------------
>
> Key: TEZ-1267
> URL: https://issues.apache.org/jira/browse/TEZ-1267
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Siddharth Seth
> Assignee: Jeff Zhang
> Priority: Critical
> Attachments: Tez-1267.patch
>
>
> Events are generated by user code. In some places they're also handled by
> user code within the AM. Currently, exceptions which are generated when
> handling user code will end up killing the AM (and hence leading to a retry).
> Instead, failure to handle such events, should cause the application to fail.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)