[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harsh J reopened MAPREDUCE-3969:
--------------------------------

      Assignee: Harsh J

Arun,

Did you get a chance to check the issue before closing this? This is very well 
present in branch-1 too. Reported version was set to 0.20.2 but it was CDH3 per 
his environment field, which shares much of the same security code, making it 
worthwhile to at least take a look at.

Its very rare to hit this, but it is present and it is pretty critical once it 
hits (and requires a specific TT to restart to resume the tasks, else the job 
just hangs).

The fix would be to synchronize the writers list object before entering the 
loop that can potentially modify the list, inside of JobHistory.log(…):

{code}
+ synchronized (writers) {
      for (Iterator<PrintWriter> iter = writers.iterator(); iter.hasNext();) {
        PrintWriter out = iter.next();
        out.println(builder.toString());
        if (out.checkError() && id != null) {
          LOG.info("Logging failed for job " + id + " removing PrintWriter from 
FileManager");
          iter.remove();
        }
      }
+    }
{code}

Its the {{iter.remove();}} conditional call that causes the issue, as it is 
done without a lock (while others may access the same set of writers in 
parallel - the fact which makes this doubly very rare to hit).

Do you agree with the above analysis Arun? If so, I'll post up a patch. Else 
lemme know what I've missed here.
                
> ConcurrentModificationException in JobHistory.java
> --------------------------------------------------
>
>                 Key: MAPREDUCE-3969
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3969
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: cdh3u1 hadoop distribution, centos 5.5.
>            Reporter: Alexey Zotov
>            Assignee: Harsh J
>
> {code}
> 2012-03-01 04:24:47,479 INFO org.apache.hadoop.mapred.JobInProgress: Task 
> 'attempt_201202150320_3709_m_000148_0' has completed 
> task_201202150320_3709_m_000148 successfully.
> 2012-03-01 04:24:47,479 INFO org.apache.hadoop.mapred.JobHistory: Logging 
> failed for job job_201202150320_3709removing PrintWriter from FileManager
> 2012-03-01 04:24:47,479 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8021, call 
> heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@61069281, false, false, 
> true, -21317) from <TASKTRACKER-IP>:450
> 41: error: java.io.IOException: java.util.ConcurrentModificationException
> java.io.IOException: java.util.ConcurrentModificationException
>         at 
> java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:343)
>         at org.apache.hadoop.mapred.JobHistory.log(JobHistory.java:591)
>         at 
> org.apache.hadoop.mapred.JobHistory$MapAttempt.logFinished(JobHistory.java:1735)
>         at 
> org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2515)
>         at 
> org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1200)
>         at 
> org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4539)
>         at 
> org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3503)
>         at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3202)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
> {code}
> Task task_201202150320_3709_m_000148 was being marked as failed (but had not 
> been restarted) at that moment and job execution was being freezed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to