Jason Lowe commented on TEZ-3893:
Thanks for the patch!
A lot of the fragility in this code stems from the fact that there are items in
the queue that we can process and items we cannot, and we're trying to juggle
them in the same queue. I'm wondering if this gets a lot cleaner if it is
refactored into two parts, a front-end dispatcher/handler and a fixed-size
thread pool executor to do the executions. The front-end _always_ pulls from
the queue (just FIFO, not priority). If the message is an allocate, the
dispatcher schedules the task with the fixed thread pool executor and tracks
the Future from that schedule in a map. If the message is a deallocate then it
looks up the Future from the map and cancels it, which will prevent it from
executing if it hasn't or should interrupt the thread that is currently
executing the task.
After that refactoring then the queue management becomes very simple. The
dispatcher takes from the queue, always processes the message, then is ready to
take from the queue again. The fixed thread pool executor takes a task,
executes it, then is ready to take the next task if any.
> Tez Local Mode can hang for cases
> Key: TEZ-3893
> URL: https://issues.apache.org/jira/browse/TEZ-3893
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jonathan Eagles
> Assignee: Jonathan Eagles
> Priority: Major
> Attachments: TEZ-3893.002.patch, TEZ-3893.1.patch
> The scheduler has a race condition where events that notify can be added
> while the blocking queue is not waiting, but just before waiting. In this
> case, we can wait forever.
This message was sent by Atlassian JIRA