[
https://issues.apache.org/jira/browse/TEZ-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354392#comment-14354392
]
Siddharth Seth commented on TEZ-2163:
-------------------------------------
>From earlier.
bq. This issue likely only affects branch 2003, but it's good to fix in any
case.
Qualifying this some more after looking at fail/kill task handling during a
push in TEZ-2003. TA_STARTED_REMOTELY for PULL is sent out when a container
asks for work. For a PUSH, it's more intuitive to send this out after the push
is complete - which is where the issue shows up. STATUS_UPDATES, DONE, KILLED,
FAILED - any of these could show up.
I think it's best to drop this issue for now (close as INVALID), and fix it in
the branch for all the states.
We could fix it by having tasks send out the START message as there first
message once they start running. Container timeouts will take care of handling
network failures. Alternately, we could introduce another state, which can
handle the additional events from a task which hasn't yet been registered as
STARTED.
> Task status update should be handled in the START_WAIT state
> ------------------------------------------------------------
>
> Key: TEZ-2163
> URL: https://issues.apache.org/jira/browse/TEZ-2163
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Siddharth Seth
> Assignee: Jeff Zhang
> Priority: Critical
> Attachments: TEZ-2163-1.patch, TEZ-2163-2.patch
>
>
> It;s possible for a task to send in a STATUS_UPDATE before the
> TA_STARTED_REMOTELY message is processed within the AM.
> {code}
> 2015-02-27 13:21:15,491 ERROR [Dispatcher thread: Central]
> impl.TaskAttemptImpl: Can't handle this event at current state for
> attempt_1424502260528_0177_5_03_000223_0
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event:
> TA_STATUS_UPDATE at START_WAIT
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:670)
> at
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:112)
> at
> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1835)
> at
> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:1820)
> at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
> at java.lang.Thread.run(Thread.java:745)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)