Unfortunately, I'm about to give up on hbase over ec2.
In my application, the hbase storage is very simple, write-once text
storage. To get this to work on ec2, I've concluded I need the
following:
1. A cluster of hadoop machines running an appropriate version of
hadoop (0.16.3 at the time of this writing)
2. Hbase running on the same cluster, either connected to S3, which
I've been warned as slow, or HDFS on top of PersistentFS which may or
may not fair better.
3. Thrift service running atop hbase for interaction from remote
(outside ec2) Python and PHP scripts.
4. Static IP's for any hadoop nodes running data-transfer jobs due to
firewall restrictions on the MySQL end (outside ec2), and also so that
the Python/PHP scripts know where to find Thrift.
5. Mechanism to force all hbase nodes to write any memory-resident
changes to disc for backup purposes (Java).
Now, my particular problem is very simple - just numeric key to text
storage. Ex: { "1":"Hello", "2":"World" }. I've (nearly) come to the
conclusion that I would be much better off either:
a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
heirarchical dir structure to keep the directories small - I've got
about 4 million such number/text pairs at the moment).
b. Using SimpleDB (which I've yet to learn, but expect to be similar
to hbase/BigTable)
c. Running an hbase/hadoop cluster somewhere else (I already have a
single-node cluster working great on our hosting provider's internal
network).
So unless the process is drastically simpler than I've estimated, I
think my next stop is going to be a SimpleDB tutorial, keeping my
hbase work handy as another alternative.
-- Jim R. Wilson (jimbojw)