> how large are the text values to the numeric keys?
The vast majority (75%+) are less than 100 bytes, the other 25% go up
to around 130kb. However, there is some clustering available -
there's a good chance I'll be able to batch together many small items
together into one gzip bundle.
-- Jim
On Thu, May 8, 2008 at 10:56 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
> how large are the text values to the numeric keys?
>
> i'm running a >40 node Hadoop cluster that launches ~40 mr jobs to do
> nothing more than bin event streams by symbol, apply some math, and stuff
> them into S3 (as zip files, ugh) for pickup. these zips are in the few megs
> size range, and I have about 20k symbols (currently, next app will have 200k
> symbols). (cascading makes and manages all the mr jobs for me).
>
> once we have validated all the result data sets, we will probably start
> mirroring a subset of the data (it's daily) in Hbase for further adhoc query
> support.
>
> point being we are tackling each piece an element at a time. get hadoop/ec2
> up and stable, run larger and larger jobs/clusters, validate data, improve
> data accessibility, etc, etc. we would have went made trying to get it all
> up and going in one shot.
>
> Also, keep in mind EC2 will have permanent local storage soon. So backing
> up incrementally to S3 may not be necessary depending on the SLA for the
> storage. So a long lived Hadoop cluster can be as permanent as any local
> cluster in a datacenter.
>
> ckw
>
>
>
> On May 8, 2008, at 8:19 AM, Jim R. Wilson wrote:
>
>
> > Unfortunately, I'm about to give up on hbase over ec2.
> >
> > In my application, the hbase storage is very simple, write-once text
> > storage. To get this to work on ec2, I've concluded I need the
> > following:
> >
> > 1. A cluster of hadoop machines running an appropriate version of
> > hadoop (0.16.3 at the time of this writing)
> >
> > 2. Hbase running on the same cluster, either connected to S3, which
> > I've been warned as slow, or HDFS on top of PersistentFS which may or
> > may not fair better.
> >
> > 3. Thrift service running atop hbase for interaction from remote
> > (outside ec2) Python and PHP scripts.
> >
> > 4. Static IP's for any hadoop nodes running data-transfer jobs due to
> > firewall restrictions on the MySQL end (outside ec2), and also so that
> > the Python/PHP scripts know where to find Thrift.
> >
> > 5. Mechanism to force all hbase nodes to write any memory-resident
> > changes to disc for backup purposes (Java).
> >
> > Now, my particular problem is very simple - just numeric key to text
> > storage. Ex: { "1":"Hello", "2":"World" }. I've (nearly) come to the
> > conclusion that I would be much better off either:
> >
> > a. Using an S3 bucket to store 1.txt, 2.txt etc (probably with a
> > heirarchical dir structure to keep the directories small - I've got
> > about 4 million such number/text pairs at the moment).
> >
> > b. Using SimpleDB (which I've yet to learn, but expect to be similar
> > to hbase/BigTable)
> >
> > c. Running an hbase/hadoop cluster somewhere else (I already have a
> > single-node cluster working great on our hosting provider's internal
> > network).
> >
> > So unless the process is drastically simpler than I've estimated, I
> > think my next stop is going to be a SimpleDB tutorial, keeping my
> > hbase work handy as another alternative.
> >
> > -- Jim R. Wilson (jimbojw)
> >
>
> Chris K Wensel
> [EMAIL PROTECTED]
> http://chris.wensel.net/
> http://www.cascading.org/
>
>
>
>
>