Re: [hbase-user] about to give up on hbase/ec2

Chris K Wensel Thu, 08 May 2008 08:57:28 -0700

how large are the text values to the numeric keys?

i'm running a >40 node Hadoop cluster that launches ~40 mr jobs to donothing more than bin event streams by symbol, apply some math, andstuff them into S3 (as zip files, ugh) for pickup. these zips are inthe few megs size range, and I have about 20k symbols (currently, nextapp will have 200k symbols). (cascading makes and manages all the mrjobs for me).

once we have validated all the result data sets, we will probablystart mirroring a subset of the data (it's daily) in Hbase for furtheradhoc query support.

point being we are tackling each piece an element at a time. gethadoop/ec2 up and stable, run larger and larger jobs/clusters,validate data, improve data accessibility, etc, etc. we would havewent made trying to get it all up and going in one shot.

Also, keep in mind EC2 will have permanent local storage soon. Sobacking up incrementally to S3 may not be necessary depending on theSLA for the storage. So a long lived Hadoop cluster can be aspermanent as any local cluster in a datacenter.


ckw

On May 8, 2008, at 8:19 AM, Jim R. Wilson wrote:

Unfortunately, I'm about to give up on hbase over ec2.

In my application, the hbase storage is very simple, write-once text
storage.  To get this to work on ec2, I've concluded I need the
following:

1. A cluster of hadoop machines running an appropriate version of
hadoop (0.16.3 at the time of this writing)

2. Hbase running on the same cluster, either connected to S3, which
I've been warned as slow, or HDFS on top of PersistentFS which may or
may not fair better.

3. Thrift service running atop hbase for interaction from remote
(outside ec2) Python and PHP scripts.

4. Static IP's for any hadoop nodes running data-transfer jobs due to
firewall restrictions on the MySQL end (outside ec2), and also so that
the Python/PHP scripts know where to find Thrift.

5. Mechanism to force all hbase nodes to write any memory-resident
changes to disc for backup purposes (Java).

Now, my particular problem is very simple - just numeric key to text
storage.  Ex: { "1":"Hello", "2":"World" }.  I've (nearly) come to the
conclusion that I would be much better off either:

a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
heirarchical dir structure to keep the directories small - I've got
about 4 million such number/text pairs at the moment).

b. Using SimpleDB (which I've yet to learn, but expect to be similar
to hbase/BigTable)

c. Running an hbase/hadoop cluster somewhere else (I already have a
single-node cluster working great on our hosting provider's internal
network).

So unless the process is drastically simpler than I've estimated, I
think my next stop is going to be a SimpleDB tutorial, keeping my
hbase work handy as another alternative.

-- Jim R. Wilson (jimbojw)


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/

Re: [hbase-user] about to give up on hbase/ec2

Reply via email to