[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509385#comment-13509385
 ] 

Xuan Gong commented on MAPREDUCE-4835:
--------------------------------------

The method "Have JobImpl.finished ignore incrementing any metrics if the job is 
already in a terminal state (SUCCEEDED/FAILED/KILLED) to avoid double-counting 
a job." may not work. But before we call the finished, the current states is 
already changed. So, it is very difficult to check previous status is terminal 
states or not.
For example, somehow we did InternalErrorTransition, it will change to state 
from succeeded to error. From the code at InternalErrorTransition, 
    public void transition(JobImpl job, JobEvent event) {
      //TODO Is this JH event required.
      job.setFinishTime();
      JobUnsuccessfulCompletionEvent failedEvent =
          new JobUnsuccessfulCompletionEvent(job.oldJobId,
              job.finishTime, 0, 0,
              JobStateInternal.ERROR.toString());
      job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent)); <-- 
this line is actually change the states
      job.finished(JobStateInternal.ERROR); <-- this line will increase the 
failure count that is duplicate
    }
So, what we can do is add JobStateInternal previousState = getInternalState() 
before job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent)), 
and check the previousState to decide whether we need to increase the count or 
not.
For example, if we do not want to increase the count when we change the 
terminal states to error state. We can do:
In InternalErrorTransition, 
    public void transition(JobImpl job, JobEvent event) {
      //TODO Is this JH event required.
      job.setFinishTime();
      JobUnsuccessfulCompletionEvent failedEvent =
          new JobUnsuccessfulCompletionEvent(job.oldJobId,
              job.finishTime, 0, 0,
              JobStateInternal.ERROR.toString());
      JobStateInternal previousState = job.getInternalState();
      job.eventHandler.handle(new JobHistoryEvent(job.jobId, failedEvent));
      //check the previous state is not terminal states, is not error states, 
when we meet error states, we should have already increase the count, we do not 
want to do it again
      if(previousState != JobStateInternal.SUCCEEDED || previousState != 
JobStateInternal.KILLED || previousState != JobStateInternal.FAILED || 
previousState != JobStateInternal.ERROR)
      {
          job.finished(JobStateInternal.ERROR);
      }
    }
                
> AM job metrics can double-count a job if it errors after entering a 
> completion state
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4835
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4835
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.0.3-alpha, 0.23.6
>            Reporter: Jason Lowe
>            Priority: Minor
>
> If JobImpl enters the SUCCEEDED, FAILED, or KILLED state but then encounters 
> an invalid state transition, it could double-count the job since jobs that 
> encounter an error are considered failed jobs.  Therefore the job could be 
> counted initially as a successful, failed, or killed job, respectively, then 
> counted again as a failed job due to the internal error afterwards.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to