12345678901234567890123456789012345678901234567890123456789012345

Performance always depends on the work load. However, having said
that, you should read Michael Stonebraker's paper "The End of an
Architectural Era (It's Time for a Complete Rewrite)" which was
presented at the Very Large Database Conference. You can find a
PDF copy of the paper here:
http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf

In this paper he presents compelling evidence that column oriented
databases (HBase is a column oriented database) can outperform
traditional RDBMS systems (MySql) by an order of magnitude or more
for almost every kind of work load. Here's a brief summary of why
this is so:

- writes: a row oriented database writes the whole row regardless
  of whether or not values are supplied for every field or not.
  Space is reserved for null fields, so the number of bytes
  written is the same for every row. In a column oriented
  database, only the columns for which values are supplied are
  written. Nulls are free. Also row oriented databases must write
  a row descriptor so that when the row is read, the column values
  can be found.

- reads: Unless every column is being returned on a read, a column
  oriented database is faster because it only reads the columns
  requested. The row oriented database must read the entire row,
  figure out where the requested columns are and only return that
  portion of the data read.

- compression: works better on a column oriented database because
  the data is similar, and stored together, which is not the case
  in a row oriented database.

- scans: suppose you have a 600GB database with 200 columns of
  equal length (the TPC-H OLTP benchmark has 212 columns) but
  while you are scanning the table you only want to return 5
  of the columns. Each column takes up 3GB of the 600GB. A row
  oriented database will have to read the entire 600GB to extract
  the 20GB of data desired. Think about how long it takes to read
  600GB vs 20GB. Furthermore, in a column oriented database, each
  column can be read in parallel, and the inner loop only executes
  once per column rather than once per row as in the row oriented
  database.

- bulk loads: column oriented databases have to construct their
  indexes as the load progresses, so even of the load goes from
  low value to high, btrees must be split and reorganized. For
  column oriented databases, this is not true.

- adding capacity: in a row oriented database, you generally have
  to dump the database, create a new partitioning scheme and then
  load the dumped data into a new database. With HBase, storage
  is only limited by the DFS. Need more storage? Add another data
  node.

We have done almost no tuning for HBase, but I'd be willing to bet
that it would handily beat MySql in a drag race.

---
Jim Kellerman, Senior Engineer; Powerset
[EMAIL PROTECTED]


> -----Original Message-----
> From: Rafael Turk [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 11, 2007 3:36 PM
> To: hadoop-user@lucene.apache.org
> Subject: HBase performance
>
> Hi All,
>
>  Does any one have comments about how Hbase will perform in a
> 4 node cluster compared to an equivalent MySQL configuration?
>
> Thanks,
>
> Rafael
>

Reply via email to