On Wed, Jul 23, 2008 at 7:33 AM, Amber <[EMAIL PROTECTED]> wrote:
> We have a 10 million row table exported from AS400 mainframe every day, the 
> table is exported as a csv text file, which is about 30GB in size, then the 
> csv file is imported into a RDBMS table which is dropped and recreated every 
> day. Now we want to find how many rows are updated during each export-import 
> interval, the table has a primary key, so deletes and inserts can be found 
> using RDBMS joins quickly, but we must do a column to column comparing in 
> order to find the difference between rows ( about 90%) with the same primary 
> keys. Our goal is to find a comparing process which takes no more than 10 
> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
> CPUs, 8GB memory  and a  300G local  RAID5 array.
>
> Bellow is our current solution:
>    The old data is kept in the RDBMS with index created on the primary key, 
> the new data is imported into HDFS as the input file of our Map-Reduce job. 
> Every map task connects to the RDBMS database, and selects old data from it 
> for every row, map tasks will generate outputs if differences are found, and 
> there are no reduce tasks.
>
> As you can see, with the number of concurrent map tasks increasing, the RDBMS 
> database will become the bottleneck, so we want to kick off the RDBMS, but we 
> have no idea about how to retrieve the old row with a given key quickly from 
> HDFS files, any suggestion is welcome.

Think of map/reduce as giving you a kind of key/value lookup for free
- it just falls out of how the system works.

You don't care about the RDBMS.  It's a distraction - you're given a
set of csv files with unique keys and dates, and you need to find the
differences between them.

Say the data looks like this:

File for jul 10:
0x1,stuff
0x2,more stuff

File for jul 11:
0x1,stuff
0x2,apples
0x3,parrot

Preprocess the csv files to add dates to the values:

File for jul 10:
0x1,20080710,stuff
0x2,20080710,more stuff

File for jul 11:
0x1,20080711,stuff
0x2,20080711,apples
0x3,20080711,parrot

Feed two days worth of these files into a hadoop job.

The mapper splits these into k=0x1, v=20080710,stuff etc.

The reducer gets one or two v's per key, and each v has the date
embedded in it - that's essentially your lookup step.

You'll end up with a system that can do compares for any two dates,
and could easily be expanded to do all sorts of deltas across these
files.

The preprocess-the-files-to-add-a-date can probably be included as
part of your mapper and isn't really a separate step - just depends on
how easy it is to use one of the off-the-shelf mappers with your data.
 If it turns out to be its own step, it can become a very simple
hadoop job.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Reply via email to