While running the test on this cluster of 14 servers, the highest loads I see are 3.68 (0.0% wa) on the master node and 2.65 (3.4% wa) on the node serving the .META. region. All the machines are on a single gigabit switch dedicated to the cluster. The highest throughput between nodes has been 21.4MBps Rx on the node hosting the .META. region.
There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found in the hadoop logs over 12 hours of running the test. I understand that I am loading the cluster - that is the point of the test, but I don't think that this should result in data loss. Failed inserts at the client level I can handle, but loss of data that was previously thought to be stored in hbase is a major issue. Are there plans to make hbase more resilient to load based failures? Regards, elsif Andrew Purtell wrote: > The reason JG points to load as being a problem as all signs point to it: > This is usually the culprit behind DFS "no live block" errors -- the namenode > is too busy and/or falling behind, or the datanodes are falling behind, or > actually failing. Also, in the log snippets you provide, HBase is complaining > about writes to DFS (for the WAL) taking in excess of 2 seconds. Also highly > indicative of load, write load. Shortly after this, Zookeeper sessions begin > expiring, which is also usually indicative of overloading -- heartbeats miss > their deadline. > > When I see these signs on my test clusters, I/O wait is generally in excess > of 40%. > > If your total CPU load is really just a few % (user + system + iowait), then > I'd suggest you look at the storage layer. Is there anything in the datanode > logs that seems like it might be relevant? > > What about the network? Gigabit? Any potential sources of contention? Are you > tracking network utilization metrics during the test? > > Also, you might consider using Ganglia to monitor and correlate system > metrics and HBase and HDFS metrics during your testing, if you are not doing > this already. > > - Andy > >
