[ 
http://issues.apache.org/jira/browse/HADOOP-573?page=comments#action_12439567 ] 
            
Doug Cutting commented on HADOOP-573:
-------------------------------------

Sort is actually the place where most checksum errors have been reported.  I 
believe this is because sorting keeps data in memory longer than other 
operations, increasing the chance that it will be corrupted there.  Does this 
node have ECC memory?  If so, memory errors are unlikely.  Sorting also 
accounts for a large portion of the number of times data is written to disk, so 
the corruption could have happened there.  It would be worth examining the 
syslog on that node to see if any disk or memory errors are reported.

I assume the reduce was rescheduled and completed?  If so, then I will resolve 
this issue.


> Checksum error during sorting in reducer
> ----------------------------------------
>
>                 Key: HADOOP-573
>                 URL: http://issues.apache.org/jira/browse/HADOOP-573
>             Project: Hadoop
>          Issue Type: Bug
>            Reporter: Runping Qi
>
> Many reduce tasks got killed due to checksum error. The strange thing is that 
> the file was generated by the sort function, and was on a local disk. Here is 
> the stack: 
> Checksum error:  ../task_0011_r_000140_0/all.2.1 at 5342920704
>       at 
> org.apache.hadoop.fs.FSDataInputStream$Checker.verifySum(FSDataInputStream.java:134)
>       at 
> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:110)
>       at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:170)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>       at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>       at java.io.DataInputStream.readFully(DataInputStream.java:176)
>       at 
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
>       at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1061)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:1126)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.nextRaw(SequenceFile.java:1354)
>       at 
> org.apache.hadoop.io.SequenceFile$Sorter$MergeStream.next(SequenceFile.java:1880)
>       at 
> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:1938)
>       at 
> org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:1802)
>       at 
> org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:1749)
>       at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1494)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:240)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1066)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to