> From: g00dn3ss <[email protected]>
[...]
> I guess it's missing some important file that I deleted
> when doing my fsck. I guess HBase also has problems if
> either of the data or index files is missing for a MapFile?
If the data file for a MapFile is gone, recovery is not
possible. If the index file is missing, it should be
regenerated on deployment. There may be a bug that prevents
this in some cases, but that will be resolved very soon I
am confident.
> I have a more general question about the HBase
> architecture. It seems like HBase is deleting and
> rewriting large portions of the table's data. This
> seems to introduce a reliability concern that multiplies
> any concerns about the reliability of the DFS itself.
One of the fundamental insights that underlie mapreduce and
bigtable clones such as Hadoop and HBase -- (Hadoop DFS can
be considered a very loosely structured database) -- is
that with modern hardware seek times dominate for updating
very large data sets. It is much more efficient to log
updates and then periodically merge them with the initial
data -- rewriting the whole database -- then it is to use
a index such as a B-tree and seek all over the place
attempting to record the same scale of updates as a set of
point writes.
Doug Cutting has a set of slides that shows an example use
case where 1 day is required only to fully update a
database by compaction/rewrite while a traditional RDMBS
would instead require 1,000 days to update the same via
seek and replace.
Hope this helps,
- Andy