Yes, I think this is the simplest method , but there are problems too:

1. The reduce stage wouldn't begin until the map stage ends, by when we have 
done a two table scanning, and the comparing will take almost the same time, 
because about 90% of intermediate <key, value> pairs will have two values and 
different keys, if I can specify a number n, by when there are n intermediate 
pairs with the same key the reduce tasks start, that will be better. In my case 
I will set the magic number to 2.

2. I am not sure about how Hadoop stores intermediate <key, value> pairs, we 
would not afford it as data volume increasing if it is kept in memory.

--------------------------------------------------
From: "James Moore" <[EMAIL PROTECTED]>
Sent: Thursday, July 24, 2008 1:12 AM
To: <[email protected]>
Subject: Re: Using MapReduce to do table comparing.

> On Wed, Jul 23, 2008 at 7:33 AM, Amber <[EMAIL PROTECTED]> wrote:
>> We have a 10 million row table exported from AS400 mainframe every day, the 
>> table is exported as a csv text file, which is about 30GB in size, then the 
>> csv file is imported into a RDBMS table which is dropped and recreated every 
>> day. Now we want to find how many rows are updated during each export-import 
>> interval, the table has a primary key, so deletes and inserts can be found 
>> using RDBMS joins quickly, but we must do a column to column comparing in 
>> order to find the difference between rows ( about 90%) with the same primary 
>> keys. Our goal is to find a comparing process which takes no more than 10 
>> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
>> CPUs, 8GB memory  and a  300G local  RAID5 array.
>>
>> Bellow is our current solution:
>>    The old data is kept in the RDBMS with index created on the primary key, 
>> the new data is imported into HDFS as the input file of our Map-Reduce job. 
>> Every map task connects to the RDBMS database, and selects old data from it 
>> for every row, map tasks will generate outputs if differences are found, and 
>> there are no reduce tasks.
>>
>> As you can see, with the number of concurrent map tasks increasing, the 
>> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, 
>> but we have no idea about how to retrieve the old row with a given key 
>> quickly from HDFS files, any suggestion is welcome.
> 
> Think of map/reduce as giving you a kind of key/value lookup for free
> - it just falls out of how the system works.
> 
> You don't care about the RDBMS.  It's a distraction - you're given a
> set of csv files with unique keys and dates, and you need to find the
> differences between them.
> 
> Say the data looks like this:
> 
> File for jul 10:
> 0x1,stuff
> 0x2,more stuff
> 
> File for jul 11:
> 0x1,stuff
> 0x2,apples
> 0x3,parrot
> 
> Preprocess the csv files to add dates to the values:
> 
> File for jul 10:
> 0x1,20080710,stuff
> 0x2,20080710,more stuff
> 
> File for jul 11:
> 0x1,20080711,stuff
> 0x2,20080711,apples
> 0x3,20080711,parrot
> 
> Feed two days worth of these files into a hadoop job.
> 
> The mapper splits these into k=0x1, v=20080710,stuff etc.
> 
> The reducer gets one or two v's per key, and each v has the date
> embedded in it - that's essentially your lookup step.
> 
> You'll end up with a system that can do compares for any two dates,
> and could easily be expanded to do all sorts of deltas across these
> files.
> 
> The preprocess-the-files-to-add-a-date can probably be included as
> part of your mapper and isn't really a separate step - just depends on
> how easy it is to use one of the off-the-shelf mappers with your data.
> If it turns out to be its own step, it can become a very simple
> hadoop job.
> 
> -- 
> James Moore | [EMAIL PROTECTED]
> Ruby and Ruby on Rails consulting
> blog.restphone.com
> 

Reply via email to