I think that the determining factor of when you should use HBase
instead of HDFS files is really the consumption pattern. If you're
only ever going to process the data in bulk, then chances are you'll
get the most performance out of a raw HDFS file. However, if you need
to have random access to some of the entries, then HBase will give
you significant benefit.
There are other factors that go into this decision. One that I can
think of off the top of my head is if you'd like to take advantage of
the versioning and semi-defined schema of HBase for your dataset. It
would be a little complicated to duplicate all of that logic on your
own from a flat file.
Another factor is your system's workflow. If you use HDFS files, you
need to be ok with always rewriting the files to do any "updates". So
even if you only add 1MB worth of new data to a 1TB dataset, you have
to rewrite the whole thing. HBase would let you "insert" it where it
belongs. (Of course, HBase has the same constraints as your
applications do, except we've already done the work to manage random
inserts.)
Does this help you out?
-Bryan
On May 13, 2008, at 10:13 AM, Naama Kraus wrote:
Hi,
Can anyone say some words on when to use HBase as opposed to using
Plain
MapReduce on input files ?
In more details, when will it make sense to put data into HBase and
then use
HBase methods to access it, including running MapReduce on the data
in the
tables. As opposed to simply putting the data into HDFS and
processing it
with MapReduce.
Thanks, Naama
On Wed, Mar 12, 2008 at 12:15 AM, Bryan Duxbury <[EMAIL PROTECTED]>
wrote:
I've written up a blog post discussing when I think it's
appropriate to
use HBase in response to some of the questions people usually ask.
You can
find it at http://blog.rapleaf.com/dev/?p=26.
-Bryan
--
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy
tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)