Class JobControl needs to be rewritten for safe concurrency
-----------------------------------------------------------
Key: HADOOP-3557
URL: https://issues.apache.org/jira/browse/HADOOP-3557
Project: Hadoop Core
Issue Type: Bug
Components: mapred
Affects Versions: 0.17.0
Reporter: Aaron Greenhouse
Attachments: JobControl.patch
The use of concurrency control and synchronization in JobControl is broken.
Consider
- The locking in addToQueue() and toArrayList() is supposed to create per-queue
locking for the queues referenced by the fields readyJobs, runningJobs,
successfulJobs, waitingJobs, and failedJobs. But this is compromised by the
fact that these methods are always called when the thread is already
synchronized on "this".
- The run() method is written to use a busy-loop and Thread.sleep(int). But
runnerState is not generally accessed using synchronization, nor is it
volatile. This means that the run() is not guaranteed to ever view updates to
runnerState. The run() method really needs to be changed to use wait(), and to
be signaled when changes are made to runnerState or to any of the queues.
Let's also consider the existing synchronized methods:
addJob() -- I'm pretty sure that the synchronization in addJob is only
necessary to protect the (indirect) access to the long field nextJobID. This
situation could be simplified by moving the synchronization to getNextJobID().
As it stands, it makes the locking in addToQueue() redundant.
allFinished() -- This is synchronized to provide an atomic snapshot of the
state of the queues. It should be changed to use wait() and be notified
whenever the size of one of the queues changes.
checkRunningJobs(), checkWaitingJobs(), and startReadyJobs() -- These are all
synchronized to prevent them from interfering with each other, themselves, and
with addJobs(). This is too conservative. We really only need to be sure that
each one doesn't interfere with itself (otherwise we could end up processing
entries twice in parallel), but this is not possible based on the intended
usage: they are are only called from the run(), so they can never be
concurrency invoked. I don't see any danger from them running concurrently with
addJob(), as long as the integrity of the underlying queues is protected. As
with addJob(), the synchronization on these methods makes the per-queue
synchronization in addToQueue redundant.
This class should be rewritten as follows:
* The field groupName should be made final.
* Add a new private boolean field stateChanged that is set to true whenever
the state of the object is changed from a public method call. This is checked
and reset by run().
* The per-queue synchronization needs to be scrapped: it isn't being
exploited any way.
* The five queue fields plus the fields runnerState, nextJobID, and the new
stateChanged should all be protected by the lock this.
* The run() method should be rewritten to use wait().
* public methods that change the state of the queues or runnerState should
set stateChanged to true and call notifyAll().
* Ideally, allFinished() should be changed to use wait() as well. This
would change the public interface of the class and break existing clients. So
this method should be @Deprecated, and a new waitForAllFinished() method should
be added.
* The queue fields should be assigned HashMap objects instead of Hashtable
objects. There is no advantage to using the synchronization provided by
Hashtable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.