We are working with a small HBase cluster (5 nodes) with fairly beefy nodes. While looking at why all the regionservers died at one time, noticed that these servers read some files 100s of times a second. This may not be cause of the error... but do you think this is odd?
HBase version : 0.20.1. The cluster was handling mainly write traffic. Note that in datanode log, there are a lot of reads these files. One of RS logs: --- 2010-04-06 21:51:33,923 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN: campaign,4522\x234\x23201003,1268865840941 2010-04-06 21:51:34,211 INFO org.apache.hadoop.hbase.regionserver.HRegion: region campaign,4522\x234\x23201003,1268865840941/407724784 available; sequence id is 1607026498 2010-04-06 21:51:43,327 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_8972126557191254374_1090962 from any node: java.io.IOException: No live nodes contain current block 2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5586169098563059270_1078171 from any node: java.io.IOException: No live nodes contain current block 2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-7610953303919156937_1089667 from any node: java.io.IOException: No live nodes contain current block [...] ---- portion of grep for one the blocks mentioned above in datanode log : ---- 39725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020, blockid: blk_8972126557191254374_1090962, duration: 97000 2010-04-06 21:51:43,307 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 10.10.0.72:50010, dest: /10.10.0.72:43699, bytes: 6976, op: HDFS_READ, cliID: DFSClient_-1439725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020, blockid: blk_8972126557191254374_1090962, duration: 76000 2010-04-06 21:51:43,310 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 10.10.0.72:50010, dest: /10.10.0.72:45123, bytes: 6976, op: HDFS_READ, cliID: DFSClient_-1439725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020, blockid: blk_8972126557191254374_1090962, duration: 93000 2010-04-06 21:51:43,314 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 10.10.0.72:50010, dest: /10.10.0.72:41891, bytes: 6976, op: HDFS_READ, cliID: DFSClient_-1439725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020, blockid: blk_8972126557191254374_1090962, duration: 267000 2010-04-06 21:51:43,318 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 10.10.0.72:50010, dest: /10.10.0.72:46412, bytes: 6976, op: HDFS_READ, cliID: DFSClient_-1439725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020, blockid: blk_8972126557191254374_1090962, duration: 91000 2010-04-06 21:51:46,330 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 10.10.0.72:50010, dest: /10.10.0.72:40657, bytes: 6976, op: HDFS_READ, cliID: DFSClient_-1439725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020, blockid: blk_8972126557191254374_1090962, duration: 85000 ------ There are thousands of repeated reads of many small files like this. --- From NN log, this block was created for /hbase/.META./1028785192/info/1728561479703335912 2010-04-06 21:51:20,906 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.META./1028785192/info/1728561479703335912. blk_8972126557191254374_1090962 ---- Btw, we had single replication set for this file by mistake. thanks for taking a look. Raghu.
