[ 
https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tahir Hashmi updated HADOOP-1152:
---------------------------------

    Attachment: 1152.patch

Looked at this with Devaraj yesterday and our theory about why this fails is 
that in MapTask.MapOutputCopier.copyOutput(), there's a call to rename a 
temporary file to the actual .out file. After the rename, another call is made 
to get the length of the actual file (which is same as that of the temporary 
file, obviously). Between these calls, if a MergeThread flushes the file to 
.out file to disk and deletes it, the call to getLength() will fail.

Sameer's suggestion was to simply invoke getLength() on the temporary file 
before the rename and discard the value in case the rename fails. After the 
file is renamed, it should be assumed that the local code no longer owns it. 
1152.patch has these changes incorporated.

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch, 1152.workaround.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: 
> java.io.IOException: Path 
> /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at 
> org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at 
> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at 
> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: 
> task_0026_r_000307_0 adding host ______  to penalty box, next contact in 192 
> seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output 
> copy failure: java.lang.NullPointerException
>   at 
> org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at 
> org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at 
> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at 
> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to