Re: Many Checksum Errors

2007-05-02 Thread Dennis Kubes

Doug,

Do we know if this is a hardware issue.  If it is possibly a software 
issue I can dedicate some resources to tracking down bugs.  I would just 
need a little guidance on where to start looking?


Dennis Kubes

Doug Cutting wrote:
Do you have ECC memory on your nodes?  Nodes without ECC have been known 
to trigger high rates of checksum errors.


Doug

Dennis Kubes wrote:

All,

We are continually experiencing checksum errors when running some jobs 
under heavy load (specifically merging segments or crawldbs).  I am 
lost as to whether this is a hardware or software problem.  Two 
questions, one is anyone else experiencing a large number of checksum 
type errors on big clusters?  Two, does anyone know if this is 
hardware or software related?  Here are some examples.


Dennis Kubes


org.apache.hadoop.fs.ChecksumException: Checksum error: 
/d01/hadoop/mapred/local/task_0042_m_001905_0/spill0.out at 79597056
at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258) 

at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211) 

at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167) 

at 
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41) 


at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at 
org.apache.hadoop.io.SequenceFile$UncompressedBytes.reset(SequenceFile.java:427) 

at 
org.apache.hadoop.io.SequenceFile$UncompressedBytes.access$700(SequenceFile.java:414) 

at 
org.apache.hadoop.io.SequenceFile$Reader.nextRawValue(SequenceFile.java:1669) 

at 
org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawValue(SequenceFile.java:2585) 

at 
org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.next(SequenceFile.java:2356) 

at 
org.apache.hadoop.io.SequenceFile$Sorter.writeFile(SequenceFile.java:2230) 

at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:517) 


at org.apache.hadoop.mapred.MapTask.run(MapTask.java:191)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1701)




Map output lost, rescheduling: getMapOutput(task_0042_m_000375_0,4) 
failed :
org.apache.hadoop.fs.ChecksumException: Checksum error: 
/d01/hadoop/mapred/local/task_0042_m_000375_0/file.out at 20267008
at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:258) 

at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211) 

at 
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167) 

at 
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41) 


at java.io.BufferedInputStream.read1(BufferedInputStream.java:254)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.read(DataInputStream.java:134)
at 
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1932) 


at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) 

at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)

at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) 


at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at 
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)

at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at 
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)

at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)



Re: Many Checksum Errors

2007-05-02 Thread Doug Cutting

Dennis Kubes wrote:
Do we know if this is a hardware issue.  If it is possibly a software 
issue I can dedicate some resources to tracking down bugs.  I would just 
need a little guidance on where to start looking?


We don't know.  The checksum mechanism is designed to catch hardware 
problems.  So one must certainly consider that as a likely cause.  If it 
is instead a software bug then it should be reproducible.  Are you 
seeing any consistent patterns?  If not, then I'd lean towards hardware.


Michael Stack has some experience tracking down problems with flaky 
memory.  Michael, did you use a test program to validate the memory on a 
node?


Again, do your nodes have ECC memory?

Doug


Re: Many Checksum Errors

2007-05-02 Thread Dennis Kubes



Doug Cutting wrote:

Dennis Kubes wrote:
Do we know if this is a hardware issue.  If it is possibly a software 
issue I can dedicate some resources to tracking down bugs.  I would 
just need a little guidance on where to start looking?


We don't know.  The checksum mechanism is designed to catch hardware 
problems.  So one must certainly consider that as a likely cause.  If it 
is instead a software bug then it should be reproducible.  Are you 
seeing any consistent patterns?  If not, then I'd lean towards hardware.


Michael Stack has some experience tracking down problems with flaky 
memory.  Michael, did you use a test program to validate the memory on a 
node?


Again, do your nodes have ECC memory?


Sorry, I was checking on that.  No, the nodes don't have ECC memory.  I 
just priced it out and it is only $20 more per Gig to go ECC so I think 
that is what we are going to do.  We are going to do some tests and I 
will keep the list updated on the progress.  Thanks for your help.


Dennis Kubes


Doug