[jira] [Commented] (TEZ-2045) TaskAttemptListener should not pull Tasks from AMContainer

Siddharth Seth (JIRA) Mon, 09 Feb 2015 16:11:17 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-2045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313216#comment-14313216
 ]


Siddharth Seth commented on TEZ-2045:
-------------------------------------

bq. How can the AMContainer be in any state other than IDLE when a new task has 
been assigned to it? The scheduler will be assigning it a new task only when 
the previous task has reported completion.
Do you mean handling of C_ASSIGN_TA at various states ? That's primarily error 
checking in the Container state machine - to ensure tasks get correct errors in 
case an incorrect allocation takes place. AMContainer doesn't go into a FAILEd 
state / fail a DAG on error - which maybe it should do. IAC, both aspects don't 
seem related to this jira.
AssignTaskAttemptTransition can be invoked from LAUNCHING or IDLE state - hence 
the check in the transition. The container stays in the Launching state while 
it waits for the NMComm to send in an event indicating launch, which can come 
in after a task is assigned.

bq. Does NodeFailedBaseTransition need a call to 
container.unregisterFromTAListener()?
I don't think it needs to. The BaseTransition is used only from 'final states' 
- STOPPING/COMPLETING. Unregistration is expected to be complete before 
entering these states. NodeFailedAtIdle, Running etc take care of 
unregistration correctly - before moving the container into one of these 
'final' states.

bq. testMultipleAllocationsAtIdle() and testAllocationAtRunning() should still 
be valid but looks like they have been removed? New test 
testMultipleAllocationsWhileActive() seems to cover only 
testAllocationAtRunning().
testMultipleAllocationsAtIdle doesn't really make sense anymore since an 
allocation at IDLE moves a container into RUNNING directly - which is verified 
in testSuccessfulTaskFlow, etc.

bq. testNodeFailedAtIdle() has been replaced by testNodeFailedAtRunning()?
Yes. Same as above, since we enter RUNNING state immediately upon allocation. 
testNodeFailedAtIdleMultipleAttempts should test the idle functionality to send 
out relevant events.

bq. testContainerCompletedAtIdle() shouldn't have needed modification because 
the behavior is still the same.
There's no current attempt any more at IDLE state, so the test is changed to 
verify that nothing goes out. testContainerCompleteAtRunning covers the case 
where an actual task was allocated / running.

bq. Not sure how testGetTaskMultiplePulls() is simulating container reuse 
without actually creating another AMContainerTask and making sure both are sent 
with the expected flow. Are there any negative cases in this flow where 
multiple getTasks should not actually happen? Eg. getTask being called when the 
task has not finished yet.
It's verifying that the same task attempt is not sent to the container multiple 
times. The scenario being tested is that an attempt completes, but a new 
attempt hasn't yet been allocated to the container - in which case the old 
attempt should not be sent back. This was earlier handled in AMContainer itself 
by setting pullAttempt to null.

bq. Earlier, getTask() was essentially reaching into the AMContainer and 
exercising its state machine via C_PULL_TA. So any incorrect state would result 
in an error. But now getTask() is simply returning the AMContainerTask 
registered at some time in time without checking what the current AMContainer 
state is. How can we be sure that its not performing an inconsistent operation 
and sending a remotePull event to the TaskAttempt?
I'd spent some time thinking about this while writing the patch. A getTask 
pulling from the AMContainer is in a race with another state changing event - 
say CONTAINER_COMPLETE. Depending on which one gets processed first, the 
container may get task or will be told to die. The same race still exists - 
depending on whether the CONTAINER_COMPLETE or the getTask gets processed 
first. If the CONTAINER_COMPLETE wins, it'll unregister from the TAListener - 
and the correct shouldDie signal is sent out. I believe the behaviour is the 
same in either case.

bq. Essentially, if the running of the task is taken over by the 
TaskAttemptListener, then shouldn't the AMContainer only represent the presence 
of running executor. Which would imply that there is no need for an IDLE state. 
It could just move from LAUNCHING to RUNNING. The RUNNING state can also error 
out when another attempt is assigned while there is a current attempt already 
assigned.
That's definitely a possibility. Skip the IDLE state altogether - and just 
track the state of the CONTAINER. IDLE currently represents RUNNING_NO_TASK, 
RUNNING represents RUNNING_TASK_ASSIGNED/RUNNING. I'd prefer deferring this 
change to a follow up jira though, if we're making it.

bq. In the context of external services. Will the TaskAttemptListener expose 
AMContainerTask to external services implementing TaskAtttemptListener?
No, it won't. It will likely have to expose some aspects contained within 
AMContainerTask however, such as the extra local resources.







> TaskAttemptListener should not pull Tasks from AMContainer
> ----------------------------------------------------------
>
>                 Key: TEZ-2045
>                 URL: https://issues.apache.org/jira/browse/TEZ-2045
>             Project: Apache Tez
>          Issue Type: Task
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-2045.1.txt, TEZ-2045.2.txt
>
>
> The flow should be reversed where AMContainer registers with the 
> TaskAttemptListener, like it's done for all other operations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2045) TaskAttemptListener should not pull Tasks from AMContainer

Reply via email to