We are working with a small HBase cluster (5 nodes) with fairly beefy nodes.
While looking at why all the regionservers died at one time, noticed that
these servers read some files 100s of times a second. This may not be cause
of the error... but do you think this is odd?

HBase version : 0.20.1. The cluster was handling mainly write traffic.
Note that in datanode log, there are a lot of reads these files.

One of RS logs:
 ---
2010-04-06 21:51:33,923 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Worker: MSG_REGION_OPEN:
campaign,4522\x234\x23201003,1268865840941
2010-04-06 21:51:34,211 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region campaign,4522\x234\x23201003,1268865840941/407724784 available;
sequence id is 1607026498
2010-04-06 21:51:43,327 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_8972126557191254374_1090962 from any node:
 java.io.IOException: No live nodes contain current block
2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-5586169098563059270_1078171 from any node:
 java.io.IOException: No live nodes contain current block
2010-04-06 21:51:43,328 INFO org.apache.hadoop.hdfs.DFSClient: Could not
obtain block blk_-7610953303919156937_1089667 from any node:
 java.io.IOException: No live nodes contain current block
[...]
----

portion of grep for one the blocks mentioned above in datanode log :
----
39725703, offset: 0, srvID: DS-977430382-10.10.0.72-50010-1266601998020,
blockid: blk_8972126557191254374_1090962, duration: 97000
2010-04-06 21:51:43,307 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.10.0.72:50010, dest: /10.10.0.72:43699, bytes: 6976, op: HDFS_READ,
cliID: DFSClient_-1439725703, offset: 0, srvID:
DS-977430382-10.10.0.72-50010-1266601998020, blockid:
blk_8972126557191254374_1090962, duration: 76000
2010-04-06 21:51:43,310 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.10.0.72:50010, dest: /10.10.0.72:45123, bytes: 6976, op: HDFS_READ,
cliID: DFSClient_-1439725703, offset: 0, srvID:
DS-977430382-10.10.0.72-50010-1266601998020, blockid:
blk_8972126557191254374_1090962, duration: 93000
2010-04-06 21:51:43,314 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.10.0.72:50010, dest: /10.10.0.72:41891, bytes: 6976, op: HDFS_READ,
cliID: DFSClient_-1439725703, offset: 0, srvID:
DS-977430382-10.10.0.72-50010-1266601998020, blockid:
blk_8972126557191254374_1090962, duration: 267000
2010-04-06 21:51:43,318 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.10.0.72:50010, dest: /10.10.0.72:46412, bytes: 6976, op: HDFS_READ,
cliID: DFSClient_-1439725703, offset: 0, srvID:
DS-977430382-10.10.0.72-50010-1266601998020, blockid:
blk_8972126557191254374_1090962, duration: 91000
2010-04-06 21:51:46,330 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.10.0.72:50010, dest: /10.10.0.72:40657, bytes: 6976, op: HDFS_READ,
cliID: DFSClient_-1439725703, offset: 0, srvID:
DS-977430382-10.10.0.72-50010-1266601998020, blockid:
blk_8972126557191254374_1090962, duration: 85000
------

There are thousands of repeated reads of many small files like this.

--- From NN log, this block was created
for /hbase/.META./1028785192/info/1728561479703335912
2010-04-06 21:51:20,906 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock: /hbase/.META./1028785192/info/1728561479703335912.
blk_8972126557191254374_1090962
----

Btw, we had single replication set for this file by mistake.

thanks for taking a look.
Raghu.

Reply via email to