We are considering using S3 as the DFS impl for hbase. I ran some
benchmarks to get an idea for the performance differences. We are
particularly interested in being able to serve data to users from
hbase, so want low latency responses for getting 10s of rows.

Each row ("transaction") has about 1K worth of data in about 5 columns
in two families. I'm using HBASE-605 to maintain a secondary index on
the transaction amount. There is also a "relation" to a customer
table, so some reads will also do a get from this other table.

First ran hbase backed by hdfs. Everything was run on EC2 small nodes.
1 node for Name node, 1 node for Data
node, 1 node with Master and Region server, 1 node to load/read data
from.

Adding 50K transactions: [56610.166]ms
Find all transactions: [35388.601]ms
FindAll page 1: [125.058]ms (PageSize is 10)
FindAll page 11: [71.89]ms
FindAll page 51: [145.54]ms
FindAll page 61: [268.486]ms

FindAll sorted page 1: [139.881]ms
FindAll sorted page 11: [1521.655]ms
FindAll sorted page 21: [2729.641]ms
FindAll sorted page 31: [3035.18]ms

Then I ran hbase backed by s3. Everything else the same

Adding 50K transaction: [104826.437]ms
Findall transaction: [51622.039]ms
Findall page 1: [5694.974]ms
Findall page 11: [4878.234]ms
Findall page 51: [5743.882]ms
Findall page 61: [4167.133]ms

Findall sorted page 1: [18535.306]ms
Then the other sorted finds timed out on the RPC call.

So to summarize:
loading data: almost twice as slow
A long scan is about 1.5 times slower
short scans are over an order of magnitude slower
and random reads (done on the sorted "scan") are over 2 orders of
magnitude slower

Do these results sound reasonable? Is S3 really that costly compared
to HDFS? Thanks for your input.
-clint

Reply via email to