Jonathan Eagles commented on TEZ-3897:

Do we need to worry about task deallocations implying the task needs to be 
interrupted/killed? The other schedulers will automatically deallocate a 
container if a task deallocate maps to a currently allocated task.
The deallocate task request when processed will issue a container being 
released message to the context that will start the container release process

Seems like there shouldn't be a PreemptTaskRequest so much as a 
DeallocateContainerRequest. Both of those kinds of requests don't need a 
priority, so whichever one is kept arguably shouldn't derive from TaskRequest 
but something like a SchedulerRequest that TaskRequest derives from as well. Or 
just have the queue hold Object rather than TaskRequest and do RTTI on 
everything in the queue as it already does.

Change the class inheritance to reflect this change.

Nit: It's a bit odd for addPreemptTaskRequest's signature to return an Object 
yet it always returns null. Better as a void method?

The api expects an object or null returned from the deallocate container 
message. However that's not possible since the message is actually processed in 
the dispatch thread. The caller ignores the return value as well. Changed the 
async handler so that it doesn't return null.

I didn't see in the patch where the actual preemption of the running task 
occurs. I would expect there to be a corresponding change in 
LocalContainerLauncher to implement the preempt of the running task, but it 
still explicitly ignores any requests to stop a container.
Added the container being released message for deallocate container (preemption 
case) and added logic in LocalContainerLauncher to cancel the future. Verified 
that futures are cancelled and that preemption works correctly.

> Tez Local Mode hang for vertices with broadcast input
> -----------------------------------------------------
>                 Key: TEZ-3897
>                 URL: https://issues.apache.org/jira/browse/TEZ-3897
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>            Priority: Major
>         Attachments: TEZ-3897.001.patch, TEZ-3897.002.patch
> Broadcast edges are not taken into consideration for slow-start edges so 
> downstream tasks in local mode can start before upstream tasks. Without 
> preemption in the scheduler, there will be a hang.

This message was sent by Atlassian JIRA

Reply via email to