ChecksumFileSystem checksum file size incorrect.
------------------------------------------------

                 Key: HADOOP-2080
                 URL: https://issues.apache.org/jira/browse/HADOOP-2080
             Project: Hadoop
          Issue Type: Bug
          Components: fs
    Affects Versions: 0.14.2, 0.14.1, 0.14.0
         Environment: Sun jdk1.6.0_02 running on Linux CentOS 5
            Reporter: Richard Lee


Periodically, reduce tasks hang. When the log for the task is consulted, you 
see a stacktrace that looks like this:

2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: 
java.io.IOException: Insufficient space
        at 
org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
        at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326)
        at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140)
        at 
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122)
        at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)
        at 
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637)

The problem stems from a miscalculation of the checksum file created in the 
InMemoryFileSystem associated with the data being copied from a completed 
mapper task to the reducer task.

The method used for calculating checksum file size is the following 
(ChecksumFileSystem:318):

((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length;

The issue here is the cast to float.  Floating point numbers have only 24 bits 
of precision, thus will return short values on any size over 0x1000000.  The 
fix is to replace this calculation with something that doesn't cast to float.

(((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to