Yeah, it seems on the edge of feasibility for me. I'd much rather see a queue + MapReduceful model.
On Fri, Feb 27, 2009 at 8:28 AM, Andrew Purtell <[email protected]> wrote: > I have done something like this for a different domain but with > similar scale and user demands. The analysts by necessity needed > to specify their queries in advance and we periodically ran > mapreduce jobs to materialize into a cache (another HBase table) > new results as new fresh data arrived. Serving answers out of > cache of course then was very fast. Because we were precomputing > answers the analysts needed to apply some forethought and > discipline, query capacity had to be rationed, workflows had to > tolerate slightly out of date information, and despite all of > these necessary "drawbacks" the system was quite successful. > > Without precomputing the answer to such queries I don't see how > one can present an assembly of such information sourced from TB > (or PB) of data in less than 10 seconds. The essential strategy > here is shifting computation in time and trading cheap disk for > probably impossible CPU and index I/O demands for would be real- > time queries. > > Maybe someone else can speak up if they think I am being too > pessimistic here. > > Hope this helps, > > - Andy > > > From: Bradford Stephens <[email protected]> > > Subject: Re: HBase and Web-Scale BI > > To: [email protected] > > Date: Thursday, February 26, 2009, 4:05 PM > > Sure, here we go! I'm not at all opposed to indexing > > tables, etc. I just > > want this thing to be fast and non-klugdy. > > > > Basically, we're getting social media (like Blogs), > > normalizing the data > > into fields, and then doing BI on that. > > > > Our data is pretty simple ... here's an example: > > > > Document: > > > > BodyText (string) > > BodyText Keywords (Lucene indexed) > > URL (indexed, key with collection time?) > > ParentDocumentID > > Post Date (datetime) > > Author Name (indexed, string) > > Post Topic (string) > > BodyLinks (list of URLs, possibly indexed?) > > > > > > An example query our user would build in the web interface > > might be, "What > > are the top 15 keywords for all documents from Feb 1st - > > April 10th where > > the author is one of these five people". We would > > need to aggregate this > > data and have it presented in no more than 10 seconds. > > > > We're expecting dozens of TB of data, perhaps more... > > > > > > On Thu, Feb 26, 2009 at 1:52 PM, Ryan Rawson > > <[email protected]> wrote: > > > > > I may have misspoke somewhat - hbase is actually quite > > good at random > > > reads. But the catch is, it can only randomly read > > via the row id. It's > > > more or less akin to having a DB table with only a > > index primary key, and > > > no > > > secondary indexes. > > > > > > So, yes, random reads and "index scans" > > work, and work well. You just have > > > to handle the index creation and maintenance yourself. > > > > > > -ryan > > > > > > On Thu, Feb 26, 2009 at 12:06 PM, Jonathan Gray > > <[email protected]> wrote: > > > > > > > Bradford, > > > > > > > > Many of us probably have some input but it's > > really difficult to help > > > > without having more detail. > > > > > > > > Can you be more specific about the layout of the > > data and the queries > > > you'd > > > > want to run? > > > > > > > > HBase is efficient at scanning (as with hdfs), > > but also efficient at > > > > randomly accessing by row key. If you need to > > fetch based on column > > > names > > > > or values, then hbase will not be efficient > > without some form of > > > secondary > > > > indexing (additional tables in hbase or something > > external like lucene). > > > > > > > > JG > > > > > > > > > -----Original Message----- > > > > > From: Bradford Stephens > > [mailto:[email protected]] > > > > > Sent: Thursday, February 26, 2009 10:37 AM > > > > > To: [email protected] > > > > > Subject: Re: HBase and Web-Scale BI > > > > > > > > > > Yes, it seems that the fundamental > > 'differentness' of HDFS/MapReduce is > > > > > that > > > > > they're not very well suited to random > > access -- I was hoping HBase had > > > > > found a way 'around' that, but of > > course that 'differentness' is a > > > > > fundamental strength of the HDFS way of > > doing things. > > > > > > > > > > Where things have gotten murky is that our > > data is very simple -- we > > > > > just > > > > > have a lot of it. And we don't need to > > do a *lot* of random access to > > > > > our > > > > > data -- it really doesn't feel like an > > RDBMS situation. > > > > > > > > > > Perhaps if we made an index out of a hash of > > each of our data values, > > > > > and > > > > > did some 'normalization', that > > could be the key. Or maybe the metadata > > > > > is > > > > > not going to be as large as I thought... > > hrm. > > > > > > > > > > I appreciate the input, and hope more people > > will chime in :) > > > > > > > > > > On Wed, Feb 25, 2009 at 10:18 PM, Ryan > > Rawson <[email protected]> > > > > > wrote: > > > > > > > > > > > Hey, > > > > > > > > > > > > You have to be clear about what hbase > > does and does not do. HBase is > > > > > just > > > > > > not a rational database - it's > > "weakness" is it's strength. > > > > > > > > > > > > In general, you can only access rows in > > key order. Keys are stored > > > > > > lexicographically sorted however. > > There aren't declarative secondary > > > > > > indexes (minus the lucene thing, but > > that isn't an index). You have > > > > > to put > > > > > > all these pieces together to build a > > system. > > > > > > > > > > > > But, you get scalability, and > > reasonable performance, and in 0.20 you > > > > > will > > > > > > get really good performance (fast > > enough to serve websites > > > > > hopefully). > > > > > > > > > > > > In general you need to make sure your > > row-key sorts data in the order > > > > > you > > > > > > want to query by. You can do something > > like this: > > > > > > > > > > > > <user> <Long.MAX_VALUE - > > System.currentTimeMillis()> <event id> > > > > > > > > > > > > to store events in reverse > > chronological order by users. > > > > > > > > > > > > If you want another access method, you > > need to use a map-reduce and > > > > > build a > > > > > > secondary index. > > > > > > > > > > > > I dont know if this exactly answers > > your question, but hopefully > > > > > should > > > > > > give > > > > > > you more of an idea of what hbase does > > and does not do. > > > > > > > > > > > > -ryan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Feb 25, 2009 at 9:02 PM, > > Bradford Stephens < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > Greetings, > > > > > > > > > > > > > > I'm in charge of the data > > analysis and collection platform at my > > > > > company, > > > > > > > and we're basing a large part > > of our core analysis platform on > > > > > Hadoop, > > > > > > > Nutch, and Lucene -- it's a > > delight to use. However, we're going to > > > > > be > > > > > > > wanting some on-demand > > "web-scale" business intelligence, and I'm > > > > > > wondering > > > > > > > if HBase is the right solution -- > > my research hasn't given me any > > > > > > > conclusions. > > > > > > > > > > > > > > Our data set is pretty simple -- a > > bunch of XML documents which > > > > > have been > > > > > > > parsed from HTML pages, and some > > associated data (Author Name, Post > > > > > Date, > > > > > > > Influence, etc). What we would > > like to be able to do is have our > > > > > end > > > > > > users > > > > > > > do real-time (< 10 seconds) > > OLAP-type analysis on this, and have it > > > > > > > presented on a webpage. For > > example, queries like ("All authors for > > > > > the > > > > > > > past > > > > > > > two weeks who have used these > > keywords in the post bodies and what > > > > > their > > > > > > > influence score is"). I > > imagine we'll have several terabytes of > > > > > data to > > > > > > go > > > > > > > through, and we won't be able > > to do much pre-population of results. > > > > > > > > > > > > > > Is HBase low-latency enough that > > we can scale-out to solve these > > > > > sorts of > > > > > > > problems? > > > > > > > > > > > > > > Cheers, > > > > > > > Bradford > > > > > > > > > > > > > > > > > > > > > > > > > > > >
