Hi,


I am trying to run few benchmarks on a small hadoop-cluster of 4 VMs (2 on 2 
phyiscal hosts, each VM having 1 cpu core, 2GB ram, individual disk and Gbps 
bridged connectivity). I am using virtualbox as VMM.


This workload reads good number of random small files (64MB each) concurrently 
from all the HDFSdatanodes, throuh clients running on same set of VMs. I am 
using FsShell cat to read the files, and I see these checksum errors:

12/05/22 10:10:12 INFO fs.FSInputChecker: Found checksum error: b[3072, 
3584]=cb93678dc0259c978731af408f2cb493b510c948b45039a4853688fd21c2a070fc030000ff7b807f000033d20100080027
cf09e308002761d4480800450005dc2af04000400633ca816169cf816169d0c35a87c1b090973e78aa5ef880100e24446b00000101080a020fcf7b020fcea7d85a506ff1eaea5383eea539137745249aebc25e86d0feac89
c4e0c9b91bc09ee146af7e9bd103c8269486a8c748091cfc42e178f461d9127f6c9676f47fa6863bb19f2e51142725ae643ffdfbe7027798e1f11314d9aa877db99a86db25f2f6d18d5b86062de737147b918e829fb178cf
bbb57e932ab082197b1f4fa4315eae67210018c3c034b3f52481c4cebc53d1e2fd5ad4b67d87823f5e0923fa1ff579de88768f79a6df5f86a8a7eb3a68b3366063408b7292eef8f909580e3866676838ba8417bb810d9a9e
8d12c49de4522214e1c6a22b64394a1e60e020b12d5803d2b6a53fe64d00b85dc63c67a8a94758f71a7a06a786e168ea234030806026ffed07770ba6d407437a4a83b96c2b3a3c767d834a19c438a0d6f56ca6fc9099d375
ae1f95839c62f36a466818eb816d4d3ef6f3951ce3a19a3364a827bac8fd70833587c89084b847e4ceeae48df9256ef629c6325f67872478838777885f930710b71c02256b0cc66242d4974fbfb0ebcf85ef6cf4b67656dc
6918bc57083dc8868e34662c98e183163a9fc82a42fddc
org.apache.hadoop.fs.ChecksumException: Checksum error: 
/blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52284416
        at 
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
        at 
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at 
org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1457)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2172)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
        at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
        at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114)
        at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49)
        at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:349)
        at 
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1913)
        at org.apache.hadoop.fs.FsShell.cat(FsShell.java:346)
        at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1557)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1776)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895)
12/05/22 10:10:13 WARN hdfs.DFSClient: Found Checksum error for 
blk_2250776182612718654_6078 from XX.XX.XX.207:50010 at 52284416
12/05/22 10:10:13 INFO hdfs.DFSClient: Could not obtain block 
blk_2250776182612718654_6078 from any node: java.io.IOException: No live nodes 
contain current block. Will get new
 block locations from namenode and retry...
cat: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 
at 52284416
cat: Checksum error: /blk_-5591790629390980895:of:/user/hduser/15-1/part-00192 
at 30324736

Hadoop FSCK does not report any corrupt block after writing the data, but after 
every iteration of reading the data I see new corrupt blocks (with output as 
above). Interestingly,  higher the load (concurrent sequential reads) I put on 
DFS cluster chances of blocks getting corrupted increase. I (mostly) do not see 
any corruption happening when there is no or less contention at DFS servers for 
reads. 

I see few other people on web also faced the same problem :

http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/508
http://tinyurl.com/7rsckwo

It has been suggested on these threads that faulty hardware may be causing this 
issue, and these checksum errors are likely to tell so. So, I diagnosed my RAM 
(non ECC one) and HDDs but did not find any problem there. I dont have ECC ram 
to try with. And what makes me more doubtful about hardware being the culprit 
is the fact that same workloads work fine on same set of physical machines (and 
more) and do not cause any corruption of blocks. I also tried creating fresh 
VMs multiple times but did not help. 

Anybody has any suggestions on this ? Not sure, if the reason is weak VMs as I 
can see corruption happening only with VMs and corruption increases  as I 
increase the load on DFS cluster.

Thanks,
Akshay

Reply via email to