Andrew Purtell wrote:
Ran another experiment. Cluster started with 16 regions, grew to ~700. Then the HRS serving META went down. Eventually the cluster "recovered" ... with 20 regions. What happened to the other ~680? Gone, from META at least.

Were they still present on the filesystem Andrew?

The mapreduce tasks started again and were happy to process the only regions remaining. It was stunning. Of course with that level of data loss, the results were no longer meaningful. I had to do a panic reinitialization so now an new experiment is running. I didn't have time to look over the logs but my conjecture is there was a file level problem during a compaction of META. If it happens again this way next time I will look deeper.

OK. Let me help out. Lets check datanode logs too for OOMEs or for "xceiverCount X exceeds the limit of concurrent xcievers Y" or for any other complaint that would give us a clue as to why it is fragile at 1000+ nodes.

Losing that many edits to .META. "shouldn't" happen; we should be flushing the commit log so that even if a fat memcache flush fails, we'll have the commit log to replay. It may have fallen into other 'holes' such as the one where master will not split logs if shutdown.


I did try to restart the cluster in an attempt to recover. When shutting down, many regionservers threw DFS exceptions of the "null datanode[0]" variety. The master was unable to split log files due to the same type of errors, even. Meanwhile a DFS file writer external to HBase was happily creating files and writing blocks with no apparent trouble. As far as I can tell the difference was it was short lived and recently started.
Ok.

I am running HBase 0.19 on Hadoop 0.18. Maybe that makes a difference, and DFS fixes or whatever between 0.18 and 0.19 can improve reliability.

Perhaps.

However also I think my cluster is a laboratory for determining why HBASE-1084 -- and the reliability improvements in any code that interacts with the FS that are a part of it -- is needed. So I think the continuous writers scenario has found a new victim -- first it was heap, now it is DFS. I seem to be able to get up to ~700 regions (from 16) over maybe 8 to 24 hours before DFS starts taking down HRS. Sometimes recovery is fine, but sometimes as above the results are disaster. Eventually, somewhere above 1000 regions -- last time it was at about 1400 -- unrecoverable file corruption is inevitable on at least one region, probability goes to 1.0.

Let me take a look at 1084.

St.Ack

Reply via email to