Alex,

So each row = 24 column-families(?) * 300,000,000 entries/family * ~40
bytes/entry = about 270GB/row ?

And that * 100,000 rows = about 27 petabytes of data?

Is my math right here? :)


With a big enough cluster, you might be able to get that amount of data in
hadoop.  I'm not sure anyone has had an HBase installation that big.


One thing that is definitely not going to work with HBase is having single
rows that are many GBs.

A row can never be split across regions, and the default region size is
256MB (though configurable), so you'd be 3 orders of magnitude greater than
the recommended maximum.  So to directly answer your questions, one
limitation is the size of a single row.  The other limitation is the number
of regions that can be handled on each node.  The upper limits are in the
400-500 region / region-server range though this can vary depending on your
hardware and usage patterns.  That's about 100GB on an HBase node, so if you
were to get this much data into HBase you'd need several hundreds of
servers.

One thing you'd definitely need to do is rework your schema a bit, spreading
things across more rows so you can have reasonably sized regions.

My short answer would be that this is not currently possible in HBase unless
you had a very very large cluster and a bit of time to work out some bugs
that I'm sure will pop up with an installation of this size.  My question to
you is, do you really need random access of this granularity to 27 petabytes
of data?


Jonathan Gray 

 


-----Original Message-----
From: Alex Newman [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 6:21 AM
To: hbase-user@hadoop.apache.org
Subject: Scalability of HBase

Where are the scalability limitations with hbase. Number of tablets? The
size of the columns? I am thinking about
100k rows
24 columns
But with on the order 300M entries per column with something like

(timestamp,<32 byte string>)

would something like this scale.

Reply via email to