[
https://issues.apache.org/jira/browse/MAPREDUCE-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738226#action_12738226
]
Sreekanth Ramakrishnan commented on MAPREDUCE-802:
--------------------------------------------------
Currently problems arise within the systems which rely on the job events can be
classified into two categories:
# Not all code path make call to raise status change events. The reason for
this is the state change is performed in {{JobInProgress}} which does not have
handle to the list of {{JobInProgressListener}} which is managed by the
{{JobTracker}}. So the components which need the state change for
removing/updating internal structures of {{JobInProgress}} object are left out
of synch.
# Relying, on {{oldStatus}} field and member of the structure to be correctly
set by {{JobTracker}} before calling the listeners. Notable example of this is
start time changes which is described in MAPREDUCE-45
In order to solve the problems listed above following is a proposal:
* For solving the case number 1, whenever {{JobInProgress}} changes its state,
we route the associated event to {{JobTracker}}. This will ensure that any part
of code which changes the {{JobStatus}} would actually result in events being
raised.
* For solving the case number 2, we remove the the {{oldStatus}} field in
{{JobStatusChangeEvent}} as it is not always correct. The change would be an
incompatible change and old status is actually used in two schedulers
{{JobQueueJobInProgressListener}} for default scheduler and {{JobQueueManager}}
for capacity scheduler. So both these scheduler would now have to maintain
their link of old status to {{JobInProgress}}.
The changes proposed would change current pseudo code for raising events as
below:
{noformat}
JobStatus oldStatus = job.getstatus.clone
make changes to jobs status.
JobStatus newStatus = job.getstatus.clone
create event with both old and new
inform listeners
{noformat}
To following:
{noformat}
make changes to job
create JobChanged event
inform listeners
{noformat}
So scheduler would have maintain an association with the scheduling information
which they used to populate their internal structures previously on their own
instead of the {{JobTracker}} sending correct information.
Currently, default scheduler {{JobQueueTaskScheduler}} maintains the ordered
list of jobs using a {{TreeMap<JobSchedulingInfo,JobInProgress>}}, the key of
the map while update operation was constructed using _oldStatus_ field of the
{{JobStatusChangedEvent}}. With proposed changed as _oldStatus_ is removed
default scheduler would have to maintain its association between job to job
scheduling info i.e. a {{Map<JobID,JobSchedulingInfo>}} the value of a JobID
would be current {{JobSchedulingInfo}} which it used to insert into {{TreeMap}}
of the scheduler. While {{jobUpdated()}} is called removal of the old
{{JobSchedulingInfo}} from {{TreeMap}} would be done using the value from
{{Map}}, then {{Map<JobID,JobSchedulingInfo>}} and
{{TreeMap<JobSchedulingInfo,JobInProgress>}} are updated with most recent
{{JobSchedulingInfo}}.
Any comments on the above proposal and changes which it would bring to
framework?
> Simplify the job updated event notification between Jobtracker and schedulers
> -----------------------------------------------------------------------------
>
> Key: MAPREDUCE-802
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-802
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: jobtracker
> Reporter: Hemanth Yamijala
> Assignee: Sreekanth Ramakrishnan
>
> HADOOP-4053 and HADOOP-4149 added events to take care of updates to the state
> / property of a job like the run state / priority of a job notified to the
> scheduler. We've seen some issues with this framework, such as the following:
> - Events are not raised correctly at all places. If a new code path is added
> to kill a job, raising events is missed out.
> - Events are raised with incorrect event data. For e.g. typically start time
> value is missed out.
> The resulting contract break between jobtracker and schedulers has lead to
> problems in the capacity scheduler where jobs remain stuck in the queue
> without being ever removed and so on.
> It has proven complicated to get this right in the framework and fixes have
> typically still left dangling cases. Or new code paths introduce new bugs.
> This JIRA is about trying to simplify the interaction model so that it is
> more robust and works well.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.