While running the test on this cluster of 14 servers, the highest loads
I see are 3.68 (0.0% wa) on the master node and 2.65 (3.4% wa) on the
node serving the .META. region.  All the machines are on a single
gigabit switch dedicated to the cluster.  The highest throughput between
nodes has been 21.4MBps Rx on the node hosting the .META. region. 

There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
in the hadoop logs over 12 hours of running the test.

I understand that I am loading the cluster - that is the point of the
test, but I don't think that this should result in data loss.  Failed
inserts at the client level I can handle, but loss of data that was
previously thought to be stored in hbase is a major issue.  Are there
plans to make hbase more resilient to load based failures?

Regards,
elsif

Andrew Purtell wrote:
> The reason JG points to load as being a problem as all signs point to it: 
> This is usually the culprit behind DFS "no live block" errors -- the namenode 
> is too busy and/or falling behind, or the datanodes are falling behind, or 
> actually failing. Also, in the log snippets you provide, HBase is complaining 
> about writes to DFS (for the WAL) taking in excess of 2 seconds. Also highly 
> indicative of load, write load. Shortly after this, Zookeeper sessions begin 
> expiring, which is also usually indicative of overloading -- heartbeats miss 
> their deadline. 
>
> When I see these signs on my test clusters, I/O wait is generally in excess 
> of 40%. 
>
> If your total CPU load is really just a few % (user + system + iowait), then 
> I'd suggest you look at the storage layer. Is there anything in the datanode 
> logs that seems like it might be relevant?
>
> What about the network? Gigabit? Any potential sources of contention? Are you 
> tracking network utilization metrics during the test?
>
> Also, you might consider using Ganglia to monitor and correlate system 
> metrics and HBase and HDFS metrics during your testing, if you are not doing 
> this already. 
>
>    - Andy
>
>   

Reply via email to