[
https://issues.apache.org/jira/browse/HADOOP-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aaron Greenhouse updated HADOOP-3557:
-------------------------------------
Attachment: JobControl.patch
Patch file that applies the changes described.
> Class JobControl needs to be rewritten for safe concurrency
> -----------------------------------------------------------
>
> Key: HADOOP-3557
> URL: https://issues.apache.org/jira/browse/HADOOP-3557
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.17.0
> Reporter: Aaron Greenhouse
> Attachments: JobControl.patch
>
> Original Estimate: 2h
> Remaining Estimate: 2h
>
> The use of concurrency control and synchronization in JobControl is broken.
> Consider
> - The locking in addToQueue() and toArrayList() is supposed to create
> per-queue locking for the queues referenced by the fields readyJobs,
> runningJobs, successfulJobs, waitingJobs, and failedJobs. But this is
> compromised by the fact that these methods are always called when the thread
> is already synchronized on "this".
> - The run() method is written to use a busy-loop and Thread.sleep(int). But
> runnerState is not generally accessed using synchronization, nor is it
> volatile. This means that the run() is not guaranteed to ever view updates to
> runnerState. The run() method really needs to be changed to use wait(), and
> to be signaled when changes are made to runnerState or to any of the queues.
> Let's also consider the existing synchronized methods:
> addJob() -- I'm pretty sure that the synchronization in addJob is only
> necessary to protect the (indirect) access to the long field nextJobID. This
> situation could be simplified by moving the synchronization to
> getNextJobID(). As it stands, it makes the locking in addToQueue() redundant.
> allFinished() -- This is synchronized to provide an atomic snapshot of the
> state of the queues. It should be changed to use wait() and be notified
> whenever the size of one of the queues changes.
> checkRunningJobs(), checkWaitingJobs(), and startReadyJobs() -- These are all
> synchronized to prevent them from interfering with each other, themselves,
> and with addJobs(). This is too conservative. We really only need to be sure
> that each one doesn't interfere with itself (otherwise we could end up
> processing entries twice in parallel), but this is not possible based on the
> intended usage: they are are only called from the run(), so they can never be
> concurrency invoked. I don't see any danger from them running concurrently
> with addJob(), as long as the integrity of the underlying queues is
> protected. As with addJob(), the synchronization on these methods makes the
> per-queue synchronization in addToQueue redundant.
> This class should be rewritten as follows:
> * The field groupName should be made final.
> * Add a new private boolean field stateChanged that is set to true
> whenever the state of the object is changed from a public method call. This
> is checked and reset by run().
> * The per-queue synchronization needs to be scrapped: it isn't being
> exploited any way.
> * The five queue fields plus the fields runnerState, nextJobID, and the
> new stateChanged should all be protected by the lock this.
> * The run() method should be rewritten to use wait().
> * public methods that change the state of the queues or runnerState
> should set stateChanged to true and call notifyAll().
> * Ideally, allFinished() should be changed to use wait() as well. This
> would change the public interface of the class and break existing clients.
> So this method should be @Deprecated, and a new waitForAllFinished() method
> should be added.
> * The queue fields should be assigned HashMap objects instead of
> Hashtable objects. There is no advantage to using the synchronization
> provided by Hashtable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.