I may have misspoke somewhat - hbase is actually quite good at random reads. But the catch is, it can only randomly read via the row id. It's more or less akin to having a DB table with only a index primary key, and no secondary indexes.
So, yes, random reads and "index scans" work, and work well. You just have to handle the index creation and maintenance yourself. -ryan On Thu, Feb 26, 2009 at 12:06 PM, Jonathan Gray <[email protected]> wrote: > Bradford, > > Many of us probably have some input but it's really difficult to help > without having more detail. > > Can you be more specific about the layout of the data and the queries you'd > want to run? > > HBase is efficient at scanning (as with hdfs), but also efficient at > randomly accessing by row key. If you need to fetch based on column names > or values, then hbase will not be efficient without some form of secondary > indexing (additional tables in hbase or something external like lucene). > > JG > > > -----Original Message----- > > From: Bradford Stephens [mailto:[email protected]] > > Sent: Thursday, February 26, 2009 10:37 AM > > To: [email protected] > > Subject: Re: HBase and Web-Scale BI > > > > Yes, it seems that the fundamental 'differentness' of HDFS/MapReduce is > > that > > they're not very well suited to random access -- I was hoping HBase had > > found a way 'around' that, but of course that 'differentness' is a > > fundamental strength of the HDFS way of doing things. > > > > Where things have gotten murky is that our data is very simple -- we > > just > > have a lot of it. And we don't need to do a *lot* of random access to > > our > > data -- it really doesn't feel like an RDBMS situation. > > > > Perhaps if we made an index out of a hash of each of our data values, > > and > > did some 'normalization', that could be the key. Or maybe the metadata > > is > > not going to be as large as I thought... hrm. > > > > I appreciate the input, and hope more people will chime in :) > > > > On Wed, Feb 25, 2009 at 10:18 PM, Ryan Rawson <[email protected]> > > wrote: > > > > > Hey, > > > > > > You have to be clear about what hbase does and does not do. HBase is > > just > > > not a rational database - it's "weakness" is it's strength. > > > > > > In general, you can only access rows in key order. Keys are stored > > > lexicographically sorted however. There aren't declarative secondary > > > indexes (minus the lucene thing, but that isn't an index). You have > > to put > > > all these pieces together to build a system. > > > > > > But, you get scalability, and reasonable performance, and in 0.20 you > > will > > > get really good performance (fast enough to serve websites > > hopefully). > > > > > > In general you need to make sure your row-key sorts data in the order > > you > > > want to query by. You can do something like this: > > > > > > <user> <Long.MAX_VALUE - System.currentTimeMillis()> <event id> > > > > > > to store events in reverse chronological order by users. > > > > > > If you want another access method, you need to use a map-reduce and > > build a > > > secondary index. > > > > > > I dont know if this exactly answers your question, but hopefully > > should > > > give > > > you more of an idea of what hbase does and does not do. > > > > > > -ryan > > > > > > > > > > > > > > > > > > On Wed, Feb 25, 2009 at 9:02 PM, Bradford Stephens < > > > [email protected]> wrote: > > > > > > > Greetings, > > > > > > > > I'm in charge of the data analysis and collection platform at my > > company, > > > > and we're basing a large part of our core analysis platform on > > Hadoop, > > > > Nutch, and Lucene -- it's a delight to use. However, we're going to > > be > > > > wanting some on-demand "web-scale" business intelligence, and I'm > > > wondering > > > > if HBase is the right solution -- my research hasn't given me any > > > > conclusions. > > > > > > > > Our data set is pretty simple -- a bunch of XML documents which > > have been > > > > parsed from HTML pages, and some associated data (Author Name, Post > > Date, > > > > Influence, etc). What we would like to be able to do is have our > > end > > > users > > > > do real-time (< 10 seconds) OLAP-type analysis on this, and have it > > > > presented on a webpage. For example, queries like ("All authors for > > the > > > > past > > > > two weeks who have used these keywords in the post bodies and what > > their > > > > influence score is"). I imagine we'll have several terabytes of > > data to > > > go > > > > through, and we won't be able to do much pre-population of results. > > > > > > > > Is HBase low-latency enough that we can scale-out to solve these > > sorts of > > > > problems? > > > > > > > > Cheers, > > > > Bradford > > > > > > > > >
