On Wed, Jul 23, 2008 at 7:33 AM, Amber <[EMAIL PROTECTED]> wrote: > We have a 10 million row table exported from AS400 mainframe every day, the > table is exported as a csv text file, which is about 30GB in size, then the > csv file is imported into a RDBMS table which is dropped and recreated every > day. Now we want to find how many rows are updated during each export-import > interval, the table has a primary key, so deletes and inserts can be found > using RDBMS joins quickly, but we must do a column to column comparing in > order to find the difference between rows ( about 90%) with the same primary > keys. Our goal is to find a comparing process which takes no more than 10 > minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz > CPUs, 8GB memory and a 300G local RAID5 array. > > Bellow is our current solution: > The old data is kept in the RDBMS with index created on the primary key, > the new data is imported into HDFS as the input file of our Map-Reduce job. > Every map task connects to the RDBMS database, and selects old data from it > for every row, map tasks will generate outputs if differences are found, and > there are no reduce tasks. > > As you can see, with the number of concurrent map tasks increasing, the RDBMS > database will become the bottleneck, so we want to kick off the RDBMS, but we > have no idea about how to retrieve the old row with a given key quickly from > HDFS files, any suggestion is welcome.
Think of map/reduce as giving you a kind of key/value lookup for free - it just falls out of how the system works. You don't care about the RDBMS. It's a distraction - you're given a set of csv files with unique keys and dates, and you need to find the differences between them. Say the data looks like this: File for jul 10: 0x1,stuff 0x2,more stuff File for jul 11: 0x1,stuff 0x2,apples 0x3,parrot Preprocess the csv files to add dates to the values: File for jul 10: 0x1,20080710,stuff 0x2,20080710,more stuff File for jul 11: 0x1,20080711,stuff 0x2,20080711,apples 0x3,20080711,parrot Feed two days worth of these files into a hadoop job. The mapper splits these into k=0x1, v=20080710,stuff etc. The reducer gets one or two v's per key, and each v has the date embedded in it - that's essentially your lookup step. You'll end up with a system that can do compares for any two dates, and could easily be expanded to do all sorts of deltas across these files. The preprocess-the-files-to-add-a-date can probably be included as part of your mapper and isn't really a separate step - just depends on how easy it is to use one of the off-the-shelf mappers with your data. If it turns out to be its own step, it can become a very simple hadoop job. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
