By the way, please don't interpret any of the below as harsh or
critical of HBase. HBase is in a nearly impossible position if
it cannot trust its filesystem.
- Andy
> From: Andrew Purtell <[email protected]>
> Subject: HBASE-1084
> To: [email protected]
> Date: Friday, December 26, 2008, 10:29 AM
> Ran another experiment. Cluster started with 16 regions,
> grew to ~700. Then the HRS serving META went down.
> Eventually the cluster "recovered" ... with 20
> regions. What happened to the other ~680? Gone, from META at
> least. The mapreduce tasks started again and were happy to
> process the only regions remaining. It was stunning. Of
> course with that level of data loss, the results were no
> longer meaningful. I had to do a panic reinitialization so
> now an new experiment is running. I didn't have time to
> look over the logs but my conjecture is there was a file
> level problem during a compaction of META. If it happens
> again this way next time I will look deeper.
>
> I did try to restart the cluster in an attempt to recover.
> When shutting down, many regionservers threw DFS exceptions
> of the "null datanode[0]" variety. The master was
> unable to split log files due to the same type of errors,
> even. Meanwhile a DFS file writer external to HBase was
> happily creating files and writing blocks with no apparent
> trouble. As far as I can tell the difference was it was
> short lived and recently started.
>
> I am running HBase 0.19 on Hadoop 0.18. Maybe that makes a
> difference, and DFS fixes or whatever between 0.18 and 0.19
> can improve reliability. However also I think my cluster is
> a laboratory for determining why HBASE-1084 -- and the
> reliability improvements in any code that interacts with the
> FS that are a part of it -- is needed.
>
> So I think the continuous writers scenario has found a new
> victim -- first it was heap, now it is DFS. I seem to be
> able to get up to ~700 regions (from 16) over maybe 8 to 24
> hours before DFS starts taking down HRS. Sometimes recovery
> is fine, but sometimes as above the results are disaster.
> Eventually, somewhere above 1000 regions -- last time it was
> at about 1400 -- unrecoverable file corruption is inevitable
> on at least one region, probability goes to 1.0.
>
> - Andy