HBase, Bigtable, and storage engineering (was Re: Cassandra vs HBase)

Andrew Purtell Thu, 03 Sep 2009 13:16:43 -0700

Ok, I'm going to ramble about HBase, Bigtable, and storage engineering. Sorry 
about that in advance.

IIRC, Google aims for ~100 tablets per tablet server. I believe each tablet is 
also kept to around 200 MB. These details are several years old so maybe have 
changed. This minimizes the amount of churn and region unavailability in the 
event of a node failure, and allows tables to split for parallelism rapidly. 
Keeping necessary individual disk storage capacity down at the GFS layer is 
also useful to minimize impact of a failed disk. Large RAM per node and 
reasonable region counts and sizing means many tables can be cached and served 
entirely out of RAM. Bigtable is big because there are 100s if not 1000s of 
nodes participating. The performance numbers in the Bigtable paper are 
impressive because of the above. It is cheap (for Google) because they build 
their own hardware and buy components  in bulk. 

HBase operates in a different world for sure. Many evaluators or new users 
expect a lot more for a lot less. They don't build their own hardware -- but 
could and maybe should -- and don't make bulk purchases. Rather, test 
deployments of 4 or 10 nodes are common. Often the hardware is underpowered for 
the attempted load. Sometimes even smaller deployments are considered, or 
$deity forbid VMs are used, but those do not make any sense except as 
programmer tools. 

Let's consider for example, HDFS+HBase+mapreduce nodes each with dual quad core 
CPUs, 8-32 GB of RAM, and 4 x 750 GB data drives, for effective data storage 
after replication of ~ 1 TB / node. Dell PowerEdge 2950s with appropriate 
options is an example of the type of hardware which might be racked for this 
purpose. This storage density is ok for HDFS but is a lot for HBase if all of 
this storage is going to be dedicated for HBase therefore there is an 
expectation that HBase can fill it -- Set the region split threshold for large 
data tables to 1 GB each and that's still on the order of 1000 regions per 
node, many many more potentially if split points for small data tables are 
lower to allow sufficient splitting for I/O parallelism. That represents a lot 
of region churn if a node goes down. For the general case this should be more 
on the order of 200 regions/node I think, maybe 250. Judgment call, depends on 
your requirements with respect to region
 availability recovery time after fault. HBase 0.20 is much better than 0.19 
here and HBase 0.21 will be another significant improvement in region 
availability recovery time after fault, so this will become less of an issue 
over time. You still have the consideration of how many regions a region server 
can handle given the available RAM (store index sizes, etc.) Anyway, maybe you 
can set the region split threshold to 4 GB and get 1 TB of large table storage 
for 250 regions only but I don't think anyone has ever tried that or should. 
Compaction times will be killer. A lot of RAM will be necessary to buffer 
writes during it. Or, reduce the per node storage so 1 GB per region x 200 
regions x 3 way replication = 600 GB. 4 x 160 GB data drives can store 640 GB. 
This also brings more in line the caching and data processing capacity of these 
nodes with the data stored on them and a loss of 160 GB worth of block replicas 
is less impactful than a loss of 1 TB of
 block replicas for any individual drive failure. 

Cassandra or some other system with P2P/consistent hashing/Dynamo type storage 
is no panacea here because as the storage density goes up the effect of node 
and disk loss, therefore replica loss, is as significant, if you care about 
consistency and are also trying to do more with less in similar spirit. 

As an alternative, continuing with the PowerEdge example, you might consider 
filling its 6 drive bays (max configuration for 3.5" drives) with 1 TB data 
drives, and employ PXE booting and NFS/ramdisk to get a functioning system, for 
really high storage density (< $1/GB raw I think). But I would expect most of 
the data will be in HDFS, not in HBase, and most of the data will be archival 
in nature because that is a lot to process with only 4 available cores per node 
... the other 4 cores being busy with HDFS, HBase, TaskTracker, and other 
system daemons and such. 

Going further, you can consider constructing DataNodes out of Backblaze pods 
and rapidly achieve petabytes of HDFS SAN cheaply. 
(http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/)
 But then you have moved computation entirely away from the data and must deal 
with the resulting order of magnitude reduction in I/O throughput, or consider 
a big 10GE switch fabric, which is expensive. 

So what amount of archival storage do you actually need? What is the effective 
useful lifespan of the collected data? Timeseries data in particular is often 
not useful beyond some period of time. Why not set TTLs and let HBase garbage 
collect expired data upon (major) compaction? How much storage is actually 
needed then? 

   - Andy

________________________________
From: Jonathan Gray <[email protected]>
To: [email protected]
Sent: Thursday, September 3, 2009 11:04:20 AM
Subject: Re: Cassandra vs HBase

There are lots of interesting ways you can design your tables and keys to
avoid the single-regionserver hotspotting.

I did an experimental design a long time ago that pre-pended a random
value to every row key, where the value was modulo'd by the number of
regionservers or between 1/2 and 2 * #ofRS, so for a given stamp there
would be that many potential regions it could go into.  This design
doesn't make time-range MR jobs very efficient though because a single
range is spread out across the entire table... But I'm not sure you can
avoid that if you want good distribution, those two requirements are at
odds.

You say 2TB of data a day on 10-20 nodes?  What kinds of nodes are you
expecting to use?

In a month, that's 60TB of data, so 3-6TB per node?  And that's
pre-replication, so you're talking 9-18TB per node?  And you want full
random access to that, while running batch MR jobs, while continuously
importing more?  Seems that's a tall order.  You'd be adding >1000 regions
a day... and on 10-20 nodes?

Do you really need full random access to the entire raw dataset?  Could
you load into HDFS, run batch jobs against HDFS, but also have some jobs
that take HDFS data, run some aggregations/filters/etc, and then put
_that_ data into HBase?

You also say you're going to delete data.  What's the time window you want
to keep?

HBase is capable of handling lots of stuff but you seem to want to process
very large datasets (and the trifecta: heavy writes, batch/scanning reads,
and random reads) on a very small cluster.  10 nodes is really
bare-minimum for any production system serving any reasonably sized
dataset (>1TB), unless the individual nodes are very powerful.

JG

On Thu, September 3, 2009 12:15 am, stack wrote:
> On Wed, Sep 2, 2009 at 11:37 PM, Schubert Zhang <[email protected]>
> wrote:
>
>
>>
>>> Do you need to keep it all?  Does some data expire (or can it be
>>> moved offline)?
>>>
>>>
>> Yes, we need remove old data which expire.
>>
>>
>>
> When does data expire?  Or, how many Billions of rows should your cluster
> of 10-20 nodes carry at a time?
>
>
>
> The data will arrive with a minutes delay.
>
>
> Usually, we need to write/ingest tens of thousands of new rows. Many rows
>
>> with the same timestamp.
>>
>>
> Will the many rows of same timestamp all go into the one timestamp row or
>  will the key have a further qualifier such as event type to distingush
> amongst the updates that arrive at the same timestamp?
>
> What do you see the as approximate write rate and what do you think its
> spread across timestamps will be?  E.g. 10000 updates a second and all of
> the updates fit within a ten second window?
>
> Sorry for all the questions.
>
>
> St.Ack
>
>

HBase, Bigtable, and storage engineering (was Re: Cassandra vs HBase)

Reply via email to