Re: HBase performance

Jonathan Hendler Fri, 12 Oct 2007 10:55:04 -0700

One of the valid points Stonebraker makes, I think, has to do with
compression (and null values).  For example - does HBase also offer
tools, or a strategy for compression? Maybe it's comparing apples to
[whatever].


Since Vertica is also a distributed database,  I think  it may be
interesting to the newbies like myself on the list. To keep the
conversation topical - while it's true there's a major campaign of PR
around Vertica, I'd be interested in hearing more about how HBase
compares with other "column stores" or hybrids. There's a lot of
discussion in Semantic Web communities about these systems, since row
databases don't "scale well" for arbitrary reading of apparently
randomized, unstructured directed graphs. I'm NOT speaking from VAST
experience in this, but enough to know that there might be some fire in
the hot air. To experienced DBAs it can seem like a collection of "cheap
tricks" - but a collection of cheap tricks is as revolutionary as things
might get until we all have quantum computers running on Mr. Fusions.

Really, Hadoop,  HDFS, Hbase, etc has such a range of potential uses
that I'm looking for the broad view of "to Hadoop or not Hadoop".





Jim Kellerman wrote:
> FYI: I just heard Stonebraker talk at the High Performance Transaction 
> Systems Workshop this week. His presentation focused on column oriented 
> databases and not just in memory databases.
>
> His talk was quite controversial with the traditional database folks, but he 
> did make some valid points.
>
> I had no intention of making your head explode, but rather to get people to 
> at least rethink the conventional wisdom surrounding row oriented databases. 
> After all Stonebraker wrote databases that most modern ones are built from. 
> He should know something about the topic.
>
> ---
> Jim Kellerman, Senior Engineer; Powerset
> [EMAIL PROTECTED]
>
>
>   
>> -----Original Message-----
>> From: Jeff Hammerbacher [mailto:[EMAIL PROTECTED]
>> Sent: Friday, October 12, 2007 9:20 AM
>> To: hadoop-user@lucene.apache.org
>> Subject: Re: HBase performance
>>
>> hmm, i'm going to have to disagree strongly with jim here on
>> several points:
>>
>> 1) the paper you reference has nothing to do with
>> column-store performance:
>> it's all about a new, in-memory oltp system being worked on
>> in stonebraker's lab/vertica.  it's mainly about removing
>> disk access via replication (rather than maintaining a redo
>> log) and being smart about partitioning your data to maximize
>> "one-site" transactions.
>> 2) column store technology has been around for a while;
>> sybase iq would rule the world if column-oriented data stores
>> were a one-size-fits-all solution to every database problem.
>> 3) you totally ignore the impact of having an in-memory
>> "write-optimized store" to amortize the cost of writes to the
>> on-disk "read-optimized store"
>> (memtable and sstable in bigtable parlance--dunno what
>> they're called in hbase).  otherwise, write and bulk load
>> performance for a column-oriented data store is generally atrocious.
>> 4) your section on "adding capacity" has NOTHING at all to do
>> with organizing your data on disk in a column-oriented
>> fashion; it's a property of any reasonably well-designed
>> horizontally partitioned data store.
>>
>> there's a ton of hot air around this space in general, so
>> refraining from making claims like "column oriented databases
>> ... can outperform traditional RDBMS systems ... by an order
>> of magnitude or more for almost every kind of work load" will
>> prevent my head from exploding.
>> thanks,
>> jeff
>>
>> On 10/11/07, Jim Kellerman <[EMAIL PROTECTED]> wrote:
>>     
>>> 12345678901234567890123456789012345678901234567890123456789012345
>>>
>>> Performance always depends on the work load. However, having said
>>> that, you should read Michael Stonebraker's paper "The End of an
>>> Architectural Era (It's Time for a Complete Rewrite)" which was
>>> presented at the Very Large Database Conference. You can find a PDF
>>> copy of the paper here:
>>>
>>>       
>> http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf
>>     
>>> In this paper he presents compelling evidence that column oriented
>>> databases (HBase is a column oriented database) can outperform
>>> traditional RDBMS systems (MySql) by an order of magnitude
>>>       
>> or more for
>>     
>>> almost every kind of work load. Here's a brief summary of
>>>       
>> why this is
>>     
>>> so:
>>>
>>> - writes: a row oriented database writes the whole row regardless
>>>   of whether or not values are supplied for every field or not.
>>>   Space is reserved for null fields, so the number of bytes
>>>   written is the same for every row. In a column oriented
>>>   database, only the columns for which values are supplied are
>>>   written. Nulls are free. Also row oriented databases must write
>>>   a row descriptor so that when the row is read, the column values
>>>   can be found.
>>>
>>> - reads: Unless every column is being returned on a read, a column
>>>   oriented database is faster because it only reads the columns
>>>   requested. The row oriented database must read the entire row,
>>>   figure out where the requested columns are and only return that
>>>   portion of the data read.
>>>
>>> - compression: works better on a column oriented database because
>>>   the data is similar, and stored together, which is not the case
>>>   in a row oriented database.
>>>
>>> - scans: suppose you have a 600GB database with 200 columns of
>>>   equal length (the TPC-H OLTP benchmark has 212 columns) but
>>>   while you are scanning the table you only want to return 5
>>>   of the columns. Each column takes up 3GB of the 600GB. A row
>>>   oriented database will have to read the entire 600GB to extract
>>>   the 20GB of data desired. Think about how long it takes to read
>>>   600GB vs 20GB. Furthermore, in a column oriented database, each
>>>   column can be read in parallel, and the inner loop only executes
>>>   once per column rather than once per row as in the row oriented
>>>   database.
>>>
>>> - bulk loads: column oriented databases have to construct their
>>>   indexes as the load progresses, so even of the load goes from
>>>   low value to high, btrees must be split and reorganized. For
>>>   column oriented databases, this is not true.
>>>
>>> - adding capacity: in a row oriented database, you generally have
>>>   to dump the database, create a new partitioning scheme and then
>>>   load the dumped data into a new database. With HBase, storage
>>>   is only limited by the DFS. Need more storage? Add another data
>>>   node.
>>>
>>> We have done almost no tuning for HBase, but I'd be willing to bet
>>> that it would handily beat MySql in a drag race.
>>>
>>> ---
>>> Jim Kellerman, Senior Engineer; Powerset [EMAIL PROTECTED]
>>>
>>>
>>>       
>>>> -----Original Message-----
>>>> From: Rafael Turk [mailto:[EMAIL PROTECTED]
>>>> Sent: Thursday, October 11, 2007 3:36 PM
>>>> To: hadoop-user@lucene.apache.org
>>>> Subject: HBase performance
>>>>
>>>> Hi All,
>>>>
>>>>  Does any one have comments about how Hbase will perform in a
>>>> 4 node cluster compared to an equivalent MySQL configuration?
>>>>
>>>> Thanks,
>>>>
>>>> Rafael
>>>>
>>>>         
>
>

Re: HBase performance

Reply via email to