Doug,
Do we know if this is a hardware issue. If it is possibly a software
issue I can dedicate some resources to tracking down bugs. I would just
need a little guidance on where to start looking?
Dennis Kubes
Doug Cutting wrote:
Do you have ECC memory on your nodes? Nodes without ECC have been known
to trigger high rates of checksum errors.
Doug
Dennis Kubes wrote:
All,
We are continually experiencing checksum errors when running some jobs
under heavy load (specifically merging segments or crawldbs). I am
lost as to whether this is a hardware or software problem. Two
questions, one is anyone else experiencing a large number of checksum
type errors on big clusters? Two, does anyone know if this is
hardware or software related? Here are some examples.
Dennis Kubes
org.apache.hadoop.fs.ChecksumException: Checksum error:
/d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at
org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427)
at
org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414)
at
org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669)
at
org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585)
at
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356)
at
org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)
Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4)
failed :
org.apache.hadoop.fs.ChecksumException: Checksum error:
/d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258)
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211)
at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.read(DataInputStream.java:134)
at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)