> 239 "Block blk_-xxx is not valid errors", > 522 "BlockInfo not found in volumeMap" errors, > and 208 "BlockAlreadyExistsException"
I assume since you say they were found in Hadoop logs that these appeared in the datanode and/or namenode logs. If not, and instead these are from HBase logs, please correct my understanding. It seems to me that your HDFS is sick. That's not particularly helpful, I know, but HBase is a client application of HDFS and depends on its good functioning. Have you talked with anyone on or mailed the logs to [email protected] ? If so, what did they say? > Are there plans to make hbase more resilient to load based failures? Yes, this is definitely something we have done and continue to do. It's hard if you can't trust your filesystem. Once I tried running HBase on top of KFS instead of HDFS. KFS did seem slower as is the conventional wisdom but I had bigger problems... the chunkservers would randomly abort on my x86_64 nodes and even after I gave Sriram system access for gdb stack dumps, there was no clear resolution. On the other hand, if you get it working, it has working sync and append. HDFS won't have a fully working sync until 0.21. YMMV. - Andy ________________________________ From: elsif <[email protected]> To: [email protected] Sent: Wed, October 21, 2009 8:16:40 AM Subject: Re: HBase Exceptions on version 0.20.1 While running the test on this cluster of 14 servers, the highest loads I see are 3.68 (0.0% wa) on the master node and 2.65 (3.4% wa) on the node serving the .META. region. All the machines are on a single gigabit switch dedicated to the cluster. The highest throughput between nodes has been 21.4MBps Rx on the node hosting the .META. region. There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found in the hadoop logs over 12 hours of running the test. I understand that I am loading the cluster - that is the point of the test, but I don't think that this should result in data loss. Failed inserts at the client level I can handle, but loss of data that was previously thought to be stored in hbase is a major issue. Are there plans to make hbase more resilient to load based failures? Regards, elsif Andrew Purtell wrote: > The reason JG points to load as being a problem as all signs point to it: > This is usually the culprit behind DFS "no live block" errors -- the namenode > is too busy and/or falling behind, or the datanodes are falling behind, or > actually failing. Also, in the log snippets you provide, HBase is complaining > about writes to DFS (for the WAL) taking in excess of 2 seconds. Also highly > indicative of load, write load. Shortly after this, Zookeeper sessions begin > expiring, which is also usually indicative of overloading -- heartbeats miss > their deadline. > > When I see these signs on my test clusters, I/O wait is generally in excess > of 40%. > > If your total CPU load is really just a few % (user + system + iowait), then > I'd suggest you look at the storage layer. Is there anything in the datanode > logs that seems like it might be relevant? > > What about the network? Gigabit? Any potential sources of contention? Are you > tracking network utilization metrics during the test? > > Also, you might consider using Ganglia to monitor and correlate system > metrics and HBase and HDFS metrics during your testing, if you are not doing > this already. > > - Andy > >
