I'd say that the current state of Hbase is more suited to offline processing than to online serving duties, but I do envision that the roadmap for Hbase could extend to cover those capabilities. Currently, however, Michael and Jim are spending most of their time stabilizing the core of the system and working on basic performance bottlenecks, especially as several large scale Hbase installations are starting to pop up and file issues.
Here are some of the things that I think would move Hbase in the right direction for online serving: 1. Atomic appends for a single writer (HADOOP-1700): We have to have atomic appends for the commit log or durability is not guaranteed. This is a pressing issue in any case for any offline processing use case that requires a 100% guarantee on durability. 2. Real-time master failover: Need to make sure there is zero downtime on failure of the HDFS master and the Hbase master. Perhaps the Zookeeper project will provide the key part of the solution although I don't have much visibility into where Zookeeper stands and what its roadmap looks like. Can anyone say anything more? 3. More performance work: Michael did some performance measurements a while back that seemed to indicate a lot of time spent back-and-forth in RPC. We're exploring Thrift as a lighter-weight RPC mechanism, but there are probably other things to be done to reduce this cost. More analysis and measurement would be helpful. 4. Tighter integration between HDFS and Hbase: Preference for running the region server on the same node as one of the replicas of the underlying tables would lower latency. 5. Memory caching: Instead of pinning a whole Hbase table in RAM, I'd recommend the use of memcached in front of Hbase to provide cached read access. Once these things are in place, Hbase could provide a reasonably performant large-scale online serving system. The main advantages of such a system would be its flexible schema, automatic repartitioning, and centralized administration, especially when compared with a system based around many separate MySQL instances with memcached in front of them. It would not have full ACID properties but there are many interesting applications that don't require strong guarantees in those areas. Anyone who'd like to start tackling any of the above items should feel free to chime in here or jump on the Hbase IRC - more contributors always welcome! Chad Walters Search Architect Powerset > Date: Fri, 30 Nov 2007 09:50:19 -0800 > Subject: Re: Hbase for dynamic web site? > From: [EMAIL PROTECTED] > To: [email protected] > > > Are you already using memcache and related approaches? > > > On 11/30/07 9:46 AM, "Mike Perkowitz" wrote: > >> >> >> Hello! We have a web site currently built on linux/apache/mysql/php. Most >> pages do some mysql queries and then stuff the results into php/html >> templates. We've been hitting the limits of what our database can handle, >> and what we can do in realtime for the site. Our plan is to move our data >> over to Hbase, precomputing as much as we can (some queries we currently do >> with joins in mysql, for example). Our pages would then be pulling rows from >> Hbase to stuff into templates. >> >> >> >> We're still working on getting Hbase working with the amount of data we want >> to be able to handle, so haven't yet been able to test it for performance. >> Is anyone else using Hbase in this way, and what has been your experience >> with realtime performance? I haven't really seen examples of people using >> Hbase this way - another approach would be for us to use >> Hadoop/Hbase/mapreduce for computation then put results back into mysql or >> whatever for realtime access. Any experience or suggestions would be >> appreciated! >> >> >> >> Thanks, >> >> Mike >> >> >> > _________________________________________________________________ Connect and share in new ways with Windows Live. http://www.windowslive.com/connect.html?ocid=TXT_TAGLM_Wave2_newways_112007
