Yes, I think this is the simplest method , but there are problems too: 1. The reduce stage wouldn't begin until the map stage ends, by when we have done a two table scanning, and the comparing will take almost the same time, because about 90% of intermediate <key, value> pairs will have two values and different keys, if I can specify a number n, by when there are n intermediate pairs with the same key the reduce tasks start, that will be better. In my case I will set the magic number to 2.
2. I am not sure about how Hadoop stores intermediate <key, value> pairs, we would not afford it as data volume increasing if it is kept in memory. -------------------------------------------------- From: "James Moore" <[EMAIL PROTECTED]> Sent: Thursday, July 24, 2008 1:12 AM To: <[email protected]> Subject: Re: Using MapReduce to do table comparing. > On Wed, Jul 23, 2008 at 7:33 AM, Amber <[EMAIL PROTECTED]> wrote: >> We have a 10 million row table exported from AS400 mainframe every day, the >> table is exported as a csv text file, which is about 30GB in size, then the >> csv file is imported into a RDBMS table which is dropped and recreated every >> day. Now we want to find how many rows are updated during each export-import >> interval, the table has a primary key, so deletes and inserts can be found >> using RDBMS joins quickly, but we must do a column to column comparing in >> order to find the difference between rows ( about 90%) with the same primary >> keys. Our goal is to find a comparing process which takes no more than 10 >> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz >> CPUs, 8GB memory and a 300G local RAID5 array. >> >> Bellow is our current solution: >> The old data is kept in the RDBMS with index created on the primary key, >> the new data is imported into HDFS as the input file of our Map-Reduce job. >> Every map task connects to the RDBMS database, and selects old data from it >> for every row, map tasks will generate outputs if differences are found, and >> there are no reduce tasks. >> >> As you can see, with the number of concurrent map tasks increasing, the >> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, >> but we have no idea about how to retrieve the old row with a given key >> quickly from HDFS files, any suggestion is welcome. > > Think of map/reduce as giving you a kind of key/value lookup for free > - it just falls out of how the system works. > > You don't care about the RDBMS. It's a distraction - you're given a > set of csv files with unique keys and dates, and you need to find the > differences between them. > > Say the data looks like this: > > File for jul 10: > 0x1,stuff > 0x2,more stuff > > File for jul 11: > 0x1,stuff > 0x2,apples > 0x3,parrot > > Preprocess the csv files to add dates to the values: > > File for jul 10: > 0x1,20080710,stuff > 0x2,20080710,more stuff > > File for jul 11: > 0x1,20080711,stuff > 0x2,20080711,apples > 0x3,20080711,parrot > > Feed two days worth of these files into a hadoop job. > > The mapper splits these into k=0x1, v=20080710,stuff etc. > > The reducer gets one or two v's per key, and each v has the date > embedded in it - that's essentially your lookup step. > > You'll end up with a system that can do compares for any two dates, > and could easily be expanded to do all sorts of deltas across these > files. > > The preprocess-the-files-to-add-a-date can probably be included as > part of your mapper and isn't really a separate step - just depends on > how easy it is to use one of the off-the-shelf mappers with your data. > If it turns out to be its own step, it can become a very simple > hadoop job. > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com >
