Hey, There has huge improvements in latency since 0.19, in the current trunk/0.20-dev work going on. There is plans afoot to hopefully have a release by June.
I am seeing reasonable latency numbers from HBase, but my case is very pathalogical - small cells, and many of them (1k+). Given the performance characteristics and what was fixed in 0.20, I hope the new stuff will be fast enough for your needs. Furthermore, we have LZO compression enabled reliabily in 0.20, this will help storing this kind of thing greatly with not much overhead during read time. Generally though, this kind of area should be something hbase should shine at. If you are willing to put up with risk, you can try the latest svn trunk. There are still big pending changes, so dont count on your /hbase being able to be carried through to the actual 0.20 release, but give it a shot. Thanks, -ryan On Mon, May 11, 2009 at 3:56 PM, dotnetmetal <[email protected]> wrote: > Hey there, > > I've been working with Hadoop for about a year now, and have recently been > tasked with our new metadata storange and analysis platform. I'm looking > for > your advice into what I should research, and if HBase is right for our use > cases. > > Currently, we're collecting documents onto our Hadoop cluster, and then > indexing them with Lucene (and Katta). Documents have attributes like a > create date, author, bodytext, domain, etc. > > We're looking at 20TB of data to start with, growing by a few dozen a day. > > I'm researching the best way to provide BI on top of this data that our > customers can "Slice and Dice" on. HBase has some appealing > characteristics, > but I'm not sure if it's *quite* what we need, since latency is an issue. > Lucene has great indexing, but we're also going to be adding metadata > constantly and performing schema changes. > > Here's a use case: > > A customer searches for a keyword in our web UI and a list of a few hundred > thousand documents is returned. The customer would then like to select a > few > random authors from those documents for a certain date range (let's say 4 > months), and get a count of documents per author. A few hours later, these > documents are tagged with some more metadata... say, PageRank of the parent > domain. The user can use this data as part of his queries as well. We'd > like > to have a response time of 10 seconds or so. > > I don't care much about storage space, so denormalization is totally fine. > Is this a problem we can tackle in HBase or another open source distributed > DB? > > A company called "Vertica" claims to be able to do this, but I wasn't very > impressed with their architecture. "Greenplum" also looks interesting, but > I > haven't researched them much yet. > > Thanks for all your help! >
