Hi,
I am trying to run few benchmarks on a small hadoop-cluster of 4 VMs (2 on 2 phyiscal hosts, each VM having 1 cpu core, 2GB ram, individual disk and Gbps bridged connectivity). I am using virtualbox as VMM. This workload reads good number of random small files (64MB each) concurrently from all the HDFSdatanodes, throuh clients running on same set of VMs. I am using FsShell cat to read the files, and I see these checksum errors: 12/05/22 10:10:12 INFO fs.FSInputChecker: Found checksum error: b[3072, 3584]=cb93678dc0259c978731af408f2cb493b510c948b45039a4853688fd21c2a070fc030000ff7b807f000033d20100080027 cf09e308002761d4480800450005dc2af04000400633ca816169cf816169d0c35a87c1b090973e78aa5ef880100e24446b00000101080a020fcf7b020fcea7d85a506ff1eaea5383eea539137745249aebc25e86d0feac89 c4e0c9b91bc09ee146af7e9bd103c8269486a8c748091cfc42e178f461d9127f6c9676f47fa6863bb19f2e51142725ae643ffdfbe7027798e1f11314d9aa877db99a86db25f2f6d18d5b86062de737147b918e829fb178cf bbb57e932ab082197b1f4fa4315eae67210018c3c034b3f52481c4cebc53d1e2fd5ad4b67d87823f5e0923fa1ff579de88768f79a6df5f86a8a7eb3a68b3366063408b7292eef8f909580e3866676838ba8417bb810d9a9e 8d12c49de4522214e1c6a22b64394a1e60e020b12d5803d2b6a53fe64d00b85dc63c67a8a94758f71a7a06a786e168ea234030806026ffed07770ba6d407437a4a83b96c2b3a3c767d834a19c438a0d6f56ca6fc9099d375 ae1f95839c62f36a466818eb816d4d3ef6f3951ce3a19a3364a827bac8fd70833587c89084b847e4ceeae48df9256ef629c6325f67872478838777885f930710b71c02256b0cc66242d4974fbfb0ebcf85ef6cf4b67656dc 6918bc57083dc8868e34662c98e183163a9fc82a42fddc org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52284416 at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1457) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2172) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100) at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114) at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49) at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:349) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1913) at org.apache.hadoop.fs.FsShell.cat(FsShell.java:346) at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1557) at org.apache.hadoop.fs.FsShell.run(FsShell.java:1776) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895) 12/05/22 10:10:13 WARN hdfs.DFSClient: Found Checksum error for blk_2250776182612718654_6078 from XX.XX.XX.207:50010 at 52284416 12/05/22 10:10:13 INFO hdfs.DFSClient: Could not obtain block blk_2250776182612718654_6078 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... cat: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52284416 cat: Checksum error: /blk_-5591790629390980895:of:/user/hduser/15-1/part-00192 at 30324736 Hadoop FSCK does not report any corrupt block after writing the data, but after every iteration of reading the data I see new corrupt blocks (with output as above). Interestingly, higher the load (concurrent sequential reads) I put on DFS cluster chances of blocks getting corrupted increase. I (mostly) do not see any corruption happening when there is no or less contention at DFS servers for reads. I see few other people on web also faced the same problem : http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/508 http://tinyurl.com/7rsckwo It has been suggested on these threads that faulty hardware may be causing this issue, and these checksum errors are likely to tell so. So, I diagnosed my RAM (non ECC one) and HDDs but did not find any problem there. I dont have ECC ram to try with. And what makes me more doubtful about hardware being the culprit is the fact that same workloads work fine on same set of physical machines (and more) and do not cause any corruption of blocks. I also tried creating fresh VMs multiple times but did not help. Anybody has any suggestions on this ? Not sure, if the reason is weak VMs as I can see corruption happening only with VMs and corruption increases as I increase the load on DFS cluster. Thanks, Akshay