Ran another experiment. Cluster started with 16 regions, grew to ~700. Then the
HRS serving META went down. Eventually the cluster "recovered" ... with 20
regions. What happened to the other ~680? Gone, from META at least. The
mapreduce tasks started again and were happy to process the only regions
remaining. It was stunning. Of course with that level of data loss, the results
were no longer meaningful. I had to do a panic reinitialization so now an new
experiment is running. I didn't have time to look over the logs but my
conjecture is there was a file level problem during a compaction of META. If it
happens again this way next time I will look deeper.
I did try to restart the cluster in an attempt to recover. When shutting down,
many regionservers threw DFS exceptions of the "null datanode[0]" variety. The
master was unable to split log files due to the same type of errors, even.
Meanwhile a DFS file writer external to HBase was happily creating files and
writing blocks with no apparent trouble. As far as I can tell the difference
was it was short lived and recently started.
I am running HBase 0.19 on Hadoop 0.18. Maybe that makes a difference, and DFS
fixes or whatever between 0.18 and 0.19 can improve reliability. However also I
think my cluster is a laboratory for determining why HBASE-1084 -- and the
reliability improvements in any code that interacts with the FS that are a part
of it -- is needed.
So I think the continuous writers scenario has found a new victim -- first it
was heap, now it is DFS. I seem to be able to get up to ~700 regions (from 16)
over maybe 8 to 24 hours before DFS starts taking down HRS. Sometimes recovery
is fine, but sometimes as above the results are disaster. Eventually, somewhere
above 1000 regions -- last time it was at about 1400 -- unrecoverable file
corruption is inevitable on at least one region, probability goes to 1.0.
- Andy