This is merely an "in the ballpark" calculation, regarding that 10 minute / 4-node requirement...
We have a reasonably similar Hadoop job (slightly more complex in the reduce phase) running on AWS with: * 100+2 nodes (m1.xl config) * approx 3x the number of rows and data size * completes in 6 minutes You have faster processors. So it might require more on the order of 25-35 nodes for 10 min completion. That's a very rough estimate. Those other two steps (deletes, inserts) might be performed in the same pass as the compares -- and potentially quicker overall, when you consider the time to load and recreate in the RDBMS. Paco On Wed, Jul 23, 2008 at 9:33 AM, Amber <[EMAIL PROTECTED]> wrote: > We have a 10 million row table exported from AS400 mainframe every day, the > table is exported as a csv text file, which is about 30GB in size, then the > csv file is imported into a RDBMS table which is dropped and recreated every > day. Now we want to find how many rows are updated during each export-import > interval, the table has a primary key, so deletes and inserts can be found > using RDBMS joins quickly, but we must do a column to column comparing in > order to find the difference between rows ( about 90%) with the same primary > keys. Our goal is to find a comparing process which takes no more than 10 > minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz > CPUs, 8GB memory and a 300G local RAID5 array.
