Re: HBase and Web-Scale BI

Ryan Rawson Thu, 26 Feb 2009 13:52:54 -0800

I may have misspoke somewhat - hbase is actually quite good at random
reads.  But the catch is, it can only randomly read via the row id.  It's
more or less akin to having a DB table with only a index primary key, and no
secondary indexes.


So, yes, random reads and "index scans" work, and work well.  You just have
to handle the index creation and maintenance yourself.

-ryan

On Thu, Feb 26, 2009 at 12:06 PM, Jonathan Gray <[email protected]> wrote:

> Bradford,
>
> Many of us probably have some input but it's really difficult to help
> without having more detail.
>
> Can you be more specific about the layout of the data and the queries you'd
> want to run?
>
> HBase is efficient at scanning (as with hdfs), but also efficient at
> randomly accessing by row key.  If you need to fetch based on column names
> or values, then hbase will not be efficient without some form of secondary
> indexing (additional tables in hbase or something external like lucene).
>
> JG
>
> > -----Original Message-----
> > From: Bradford Stephens [mailto:[email protected]]
> > Sent: Thursday, February 26, 2009 10:37 AM
> > To: [email protected]
> > Subject: Re: HBase and Web-Scale BI
> >
> > Yes, it seems that the fundamental 'differentness' of HDFS/MapReduce is
> > that
> > they're not very well suited to random access -- I was hoping HBase had
> > found a way 'around' that, but of course that 'differentness' is a
> > fundamental strength of the HDFS way of doing things.
> >
> > Where things have gotten murky is that our data is very simple -- we
> > just
> > have a lot of it. And we don't need to do a *lot* of random access to
> > our
> > data -- it really doesn't feel like an RDBMS situation.
> >
> > Perhaps if we made an index out of a hash of each of our data values,
> > and
> > did some 'normalization',  that could be the key. Or maybe the metadata
> > is
> > not going to be as large as I thought... hrm.
> >
> > I appreciate the input, and hope more people will chime in :)
> >
> > On Wed, Feb 25, 2009 at 10:18 PM, Ryan Rawson <[email protected]>
> > wrote:
> >
> > > Hey,
> > >
> > > You have to be clear about what hbase does and does not do.  HBase is
> > just
> > > not a rational database - it's "weakness" is it's strength.
> > >
> > > In general, you can only access rows in key order.  Keys are stored
> > > lexicographically sorted however.  There aren't declarative secondary
> > > indexes (minus the lucene thing, but that isn't an index).  You have
> > to put
> > > all these pieces together to build a system.
> > >
> > > But, you get scalability, and reasonable performance, and in 0.20 you
> > will
> > > get really good performance (fast enough to serve websites
> > hopefully).
> > >
> > > In general you need to make sure your row-key sorts data in the order
> > you
> > > want to query by.  You can do something like this:
> > >
> > > <user> <Long.MAX_VALUE - System.currentTimeMillis()> <event id>
> > >
> > > to store events in reverse chronological order by users.
> > >
> > > If you want another access method, you need to use a map-reduce and
> > build a
> > > secondary index.
> > >
> > > I dont know if this exactly answers your question, but hopefully
> > should
> > > give
> > > you more of an idea of what hbase does and does not do.
> > >
> > > -ryan
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Feb 25, 2009 at 9:02 PM, Bradford Stephens <
> > > [email protected]> wrote:
> > >
> > > > Greetings,
> > > >
> > > > I'm in charge of the data analysis and collection platform at my
> > company,
> > > > and we're basing a large part of our core analysis platform on
> > Hadoop,
> > > > Nutch, and Lucene -- it's a delight to use. However, we're going to
> > be
> > > > wanting some on-demand "web-scale" business intelligence, and I'm
> > > wondering
> > > > if HBase is the right solution -- my research hasn't given me any
> > > > conclusions.
> > > >
> > > > Our data set is pretty simple -- a bunch of XML documents which
> > have been
> > > > parsed from HTML pages, and some associated data (Author Name, Post
> > Date,
> > > > Influence, etc). What we would like to be able to do is have our
> > end
> > > users
> > > > do real-time (< 10 seconds) OLAP-type analysis on this, and have it
> > > > presented on a webpage. For example, queries like ("All authors for
> > the
> > > > past
> > > > two weeks who have used these keywords in the post bodies and what
> > their
> > > > influence score is"). I imagine we'll have several terabytes of
> > data to
> > > go
> > > > through, and we won't be able to do much pre-population of results.
> > > >
> > > > Is HBase low-latency enough that we can scale-out to solve these
> > sorts of
> > > > problems?
> > > >
> > > > Cheers,
> > > > Bradford
> > > >
> > >
>
>

Re: HBase and Web-Scale BI

Reply via email to