how large are the text values to the numeric keys?

i'm running a >40 node Hadoop cluster that launches ~40 mr jobs to do nothing more than bin event streams by symbol, apply some math, and stuff them into S3 (as zip files, ugh) for pickup. these zips are in the few megs size range, and I have about 20k symbols (currently, next app will have 200k symbols). (cascading makes and manages all the mr jobs for me).

once we have validated all the result data sets, we will probably start mirroring a subset of the data (it's daily) in Hbase for further adhoc query support.

point being we are tackling each piece an element at a time. get hadoop/ec2 up and stable, run larger and larger jobs/clusters, validate data, improve data accessibility, etc, etc. we would have went made trying to get it all up and going in one shot.

Also, keep in mind EC2 will have permanent local storage soon. So backing up incrementally to S3 may not be necessary depending on the SLA for the storage. So a long lived Hadoop cluster can be as permanent as any local cluster in a datacenter.

ckw

On May 8, 2008, at 8:19 AM, Jim R. Wilson wrote:

Unfortunately, I'm about to give up on hbase over ec2.

In my application, the hbase storage is very simple, write-once text
storage.  To get this to work on ec2, I've concluded I need the
following:

1. A cluster of hadoop machines running an appropriate version of
hadoop (0.16.3 at the time of this writing)

2. Hbase running on the same cluster, either connected to S3, which
I've been warned as slow, or HDFS on top of PersistentFS which may or
may not fair better.

3. Thrift service running atop hbase for interaction from remote
(outside ec2) Python and PHP scripts.

4. Static IP's for any hadoop nodes running data-transfer jobs due to
firewall restrictions on the MySQL end (outside ec2), and also so that
the Python/PHP scripts know where to find Thrift.

5. Mechanism to force all hbase nodes to write any memory-resident
changes to disc for backup purposes (Java).

Now, my particular problem is very simple - just numeric key to text
storage.  Ex: { "1":"Hello", "2":"World" }.  I've (nearly) come to the
conclusion that I would be much better off either:

a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
heirarchical dir structure to keep the directories small - I've got
about 4 million such number/text pairs at the moment).

b. Using SimpleDB (which I've yet to learn, but expect to be similar
to hbase/BigTable)

c. Running an hbase/hadoop cluster somewhere else (I already have a
single-node cluster working great on our hosting provider's internal
network).

So unless the process is drastically simpler than I've estimated, I
think my next stop is going to be a SimpleDB tutorial, keeping my
hbase work handy as another alternative.

-- Jim R. Wilson (jimbojw)

Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/




Reply via email to