Re: more noob questions--how/when is data 'distributed' across a cluster?

Ted Dunning Fri, 04 Apr 2008 13:19:11 -0700

I should add that systems like Pig and JAQL aim to satisfy your needs very
nicely.  They may or may not be ready for your needs, but they aren't
terribly far away.


Also, you should consider whether it is better for you to have a system that
is considered "industry standard" (aka fully relational) or "somewhat
experimental or avante garde".  Different situations could force one answer
or the other.


On 4/4/08 11:48 AM, "Paul Danese" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Currently I have a large (for me) amount of data stored in a relational
> database (3 tables: each with 2 - 10 million related records. This is an
> oversimplification, but for clarity it's close enough).
> 
> There is a relatively simple Object-relational Mapping (ORM) to my
> database:  Specifically, my parent Object is called "Accident".
> Accidents can have 1 or more Report objects (has many).
> Reports can have 1 or more Outcome objects (has many).
> 
> Each of these Objects maps to a specific table in my RDBMS w/ foreign keys
> 'connecting' records between tables.
> 
> I run searches against this database (using Lucene) and this works quite
> well as long as I return only *subsets* of the *total* result-set at any one
> time.
> e.g. I may have 25,000 hits ("Accidents") that meet my threshold Lucene
> score, but as long as I only query the database for 50 Accident "objects" at
> any one time, the response time is great.
> 
> The 'problem' is that I'd also like to use those 25,000 Accidents to
> generate an electronic report as **quickly as possible**
> (right now it takes about 30 minutes to collect all 25,000 hits from the
> database, extract the relevant fields and construct the actual report).
> Most of this 30 minutes is spent hitting the database and
> processing/extracting the relevant data (generating the report is rather
> fast once all the data are properly formatted).
> 
> So...at my naive level, this seems like a decent job for hadoop.
> ***QUESTION 1: Is this an accurate belief?***
> 
> i.e., I have a semi-large collection of key/value pairs (25,000 Accident IDs
> would be the keys, and 25,000 Accident objects would be values)
> 
> These object/value pairs are "mapped" on a cluster, extracting the relevant
> data from each object.
> The mapping then releases a new set of "key/value" pairs (in this case,
> emitted keys are one of three categories (accident, report, outcome) and the
> values are arrays of accident, report and outcome data that will go into the
> report).
> 
> These emitted key/value pairs are then "reduced" and resulting reduced
> collections are used to build the report.
> 
> ***QUESTION 2:  If the answer to Q1 is "yes", how does one typically "move"
> data from a rdbms to something like HDFS/HBase?***
> ***QUESTION 3:  Am I right in thinking that my HBase data are going to be
> denormalized relative to my RDBMS?***
> ***QUESTION 4:  How are the data within an HBase database *actually*
> distributed amongst nodes?  i.e. is the distribution done automatically upon
> creating the db (as long as the cluster already exists?)  Or do you have to
> issue some type of command that says "okay...here's the HBase db, distribute
> it to nodes a - z"***
> ***QUESTION 5:  Or is this whole problem something better addressed by some
> type of high-performance rdbms cluster?***
> ***QUESTION 6:  Is there a detailed (step by step) tutorial on how to use
> HBase w/ Hadoop?***
> 
> 
> Anyway, apologies if this is the 1000th time this has been answered and
> thank you for any insight!

Re: more noob questions--how/when is data 'distributed' across a cluster?

Reply via email to