Thanks for trying this "interesting" experiment Clint. I'm a little
surprised the thing worked at all (smile).
What do you need for HBASE-50? Is it sufficient forcing the cluster to
go read-only flushing all in memory while the copy runs?
CopyFiles/distcp should be able to go between filesystems, running the
copy in an MR job which is probably what you want. You could go hdfs://
to s3://. Upside would be that all of the s3 vagaries would be managed
for you (size limits, etc.). Downside would be that you'd have to put
up an s3 dfs as copy sink. Alternatively, maybe you can convince
distcp to go between hdfs:// and http://? Hbase files should never
really be bigger than a 1G or so, so it should be 'safe'.
St.Ack
Clint Morgan wrote:
Thanks for the input as it confirmed my suspicions.
We were debating running off of S3 just to minimize moving parts. But
it does not look feasible.
We are wanting the cluster to "live forever" in that once the app is
live, hbase will always be needed to serve data.
A primary concern is data lose, so will probably still want to use S3
as a backup medium. Moreover, we'd like to be able to quickly recover
from HDFS failures to minimize downtime.This makes HBASE-50 look like
the way to go.
cheers,
-clint
On Wed, Apr 30, 2008 at 5:30 PM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
Anything relating to S3 will be slower thus it probably shouldn't be used as
the default FileSystem for Hadoop.
It works great if you need to park data between cluster runs, assuming you
do not need external (from Hadoop and the cluster) applications to be able
to read the data, as data in S3FS are stuffed into S3 as blocks (similar to
HDFS).
Further, once support for appends is added to Hadoop/HDFS, I am unsure if
it will be inherited by S3FS. I think this is a critical issue for HBase.
Assuming your aren't expecting this cluster to live forever, maybe you
should keep your authoritative data on s3 (native or s3fs) and just reload
HBase on cluster init?
ckw
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/
On Apr 30, 2008, at 1:02 PM, Clint Morgan wrote:
We are considering using S3 as the DFS impl for hbase. I ran some
benchmarks to get an idea for the performance differences. We are
particularly interested in being able to serve data to users from
hbase, so want low latency responses for getting 10s of rows.
Each row ("transaction") has about 1K worth of data in about 5 columns
in two families. I'm using HBASE-605 to maintain a secondary index on
the transaction amount. There is also a "relation" to a customer
table, so some reads will also do a get from this other table.
First ran hbase backed by hdfs. Everything was run on EC2 small nodes.
1 node for Name node, 1 node for Data
node, 1 node with Master and Region server, 1 node to load/read data
from.
Adding 50K transactions: [56610.166]ms
Find all transactions: [35388.601]ms
FindAll page 1: [125.058]ms (PageSize is 10)
FindAll page 11: [71.89]ms
FindAll page 51: [145.54]ms
FindAll page 61: [268.486]ms
FindAll sorted page 1: [139.881]ms
FindAll sorted page 11: [1521.655]ms
FindAll sorted page 21: [2729.641]ms
FindAll sorted page 31: [3035.18]ms
Then I ran hbase backed by s3. Everything else the same
Adding 50K transaction: [104826.437]ms
Findall transaction: [51622.039]ms
Findall page 1: [5694.974]ms
Findall page 11: [4878.234]ms
Findall page 51: [5743.882]ms
Findall page 61: [4167.133]ms
Findall sorted page 1: [18535.306]ms
Then the other sorted finds timed out on the RPC call.
So to summarize:
loading data: almost twice as slow
A long scan is about 1.5 times slower
short scans are over an order of magnitude slower
and random reads (done on the sorted "scan") are over 2 orders of
magnitude slower
Do these results sound reasonable? Is S3 really that costly compared
to HDFS? Thanks for your input.
-clint