Ok, I'm going to ramble about HBase, Bigtable, and storage engineering. Sorry about that in advance.
IIRC, Google aims for ~100 tablets per tablet server. I believe each tablet is also kept to around 200 MB. These details are several years old so maybe have changed. This minimizes the amount of churn and region unavailability in the event of a node failure, and allows tables to split for parallelism rapidly. Keeping necessary individual disk storage capacity down at the GFS layer is also useful to minimize impact of a failed disk. Large RAM per node and reasonable region counts and sizing means many tables can be cached and served entirely out of RAM. Bigtable is big because there are 100s if not 1000s of nodes participating. The performance numbers in the Bigtable paper are impressive because of the above. It is cheap (for Google) because they build their own hardware and buy components in bulk. HBase operates in a different world for sure. Many evaluators or new users expect a lot more for a lot less. They don't build their own hardware -- but could and maybe should -- and don't make bulk purchases. Rather, test deployments of 4 or 10 nodes are common. Often the hardware is underpowered for the attempted load. Sometimes even smaller deployments are considered, or $deity forbid VMs are used, but those do not make any sense except as programmer tools. Let's consider for example, HDFS+HBase+mapreduce nodes each with dual quad core CPUs, 8-32 GB of RAM, and 4 x 750 GB data drives, for effective data storage after replication of ~ 1 TB / node. Dell PowerEdge 2950s with appropriate options is an example of the type of hardware which might be racked for this purpose. This storage density is ok for HDFS but is a lot for HBase if all of this storage is going to be dedicated for HBase therefore there is an expectation that HBase can fill it -- Set the region split threshold for large data tables to 1 GB each and that's still on the order of 1000 regions per node, many many more potentially if split points for small data tables are lower to allow sufficient splitting for I/O parallelism. That represents a lot of region churn if a node goes down. For the general case this should be more on the order of 200 regions/node I think, maybe 250. Judgment call, depends on your requirements with respect to region availability recovery time after fault. HBase 0.20 is much better than 0.19 here and HBase 0.21 will be another significant improvement in region availability recovery time after fault, so this will become less of an issue over time. You still have the consideration of how many regions a region server can handle given the available RAM (store index sizes, etc.) Anyway, maybe you can set the region split threshold to 4 GB and get 1 TB of large table storage for 250 regions only but I don't think anyone has ever tried that or should. Compaction times will be killer. A lot of RAM will be necessary to buffer writes during it. Or, reduce the per node storage so 1 GB per region x 200 regions x 3 way replication = 600 GB. 4 x 160 GB data drives can store 640 GB. This also brings more in line the caching and data processing capacity of these nodes with the data stored on them and a loss of 160 GB worth of block replicas is less impactful than a loss of 1 TB of block replicas for any individual drive failure. Cassandra or some other system with P2P/consistent hashing/Dynamo type storage is no panacea here because as the storage density goes up the effect of node and disk loss, therefore replica loss, is as significant, if you care about consistency and are also trying to do more with less in similar spirit. As an alternative, continuing with the PowerEdge example, you might consider filling its 6 drive bays (max configuration for 3.5" drives) with 1 TB data drives, and employ PXE booting and NFS/ramdisk to get a functioning system, for really high storage density (< $1/GB raw I think). But I would expect most of the data will be in HDFS, not in HBase, and most of the data will be archival in nature because that is a lot to process with only 4 available cores per node ... the other 4 cores being busy with HDFS, HBase, TaskTracker, and other system daemons and such. Going further, you can consider constructing DataNodes out of Backblaze pods and rapidly achieve petabytes of HDFS SAN cheaply. (http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/) But then you have moved computation entirely away from the data and must deal with the resulting order of magnitude reduction in I/O throughput, or consider a big 10GE switch fabric, which is expensive. So what amount of archival storage do you actually need? What is the effective useful lifespan of the collected data? Timeseries data in particular is often not useful beyond some period of time. Why not set TTLs and let HBase garbage collect expired data upon (major) compaction? How much storage is actually needed then? - Andy ________________________________ From: Jonathan Gray <[email protected]> To: [email protected] Sent: Thursday, September 3, 2009 11:04:20 AM Subject: Re: Cassandra vs HBase There are lots of interesting ways you can design your tables and keys to avoid the single-regionserver hotspotting. I did an experimental design a long time ago that pre-pended a random value to every row key, where the value was modulo'd by the number of regionservers or between 1/2 and 2 * #ofRS, so for a given stamp there would be that many potential regions it could go into. This design doesn't make time-range MR jobs very efficient though because a single range is spread out across the entire table... But I'm not sure you can avoid that if you want good distribution, those two requirements are at odds. You say 2TB of data a day on 10-20 nodes? What kinds of nodes are you expecting to use? In a month, that's 60TB of data, so 3-6TB per node? And that's pre-replication, so you're talking 9-18TB per node? And you want full random access to that, while running batch MR jobs, while continuously importing more? Seems that's a tall order. You'd be adding >1000 regions a day... and on 10-20 nodes? Do you really need full random access to the entire raw dataset? Could you load into HDFS, run batch jobs against HDFS, but also have some jobs that take HDFS data, run some aggregations/filters/etc, and then put _that_ data into HBase? You also say you're going to delete data. What's the time window you want to keep? HBase is capable of handling lots of stuff but you seem to want to process very large datasets (and the trifecta: heavy writes, batch/scanning reads, and random reads) on a very small cluster. 10 nodes is really bare-minimum for any production system serving any reasonably sized dataset (>1TB), unless the individual nodes are very powerful. JG On Thu, September 3, 2009 12:15 am, stack wrote: > On Wed, Sep 2, 2009 at 11:37 PM, Schubert Zhang <[email protected]> > wrote: > > >> >>> Do you need to keep it all? Does some data expire (or can it be >>> moved offline)? >>> >>> >> Yes, we need remove old data which expire. >> >> >> > When does data expire? Or, how many Billions of rows should your cluster > of 10-20 nodes carry at a time? > > > > The data will arrive with a minutes delay. > > > Usually, we need to write/ingest tens of thousands of new rows. Many rows > >> with the same timestamp. >> >> > Will the many rows of same timestamp all go into the one timestamp row or > will the key have a further qualifier such as event type to distingush > amongst the updates that arrive at the same timestamp? > > What do you see the as approximate write rate and what do you think its > spread across timestamps will be? E.g. 10000 updates a second and all of > the updates fit within a ten second window? > > Sorry for all the questions. > > > St.Ack > >
