This is merely an "in the ballpark" calculation, regarding that 10
minute / 4-node requirement...

We have a reasonably similar Hadoop job (slightly more complex in the
reduce phase) running on AWS with:

   * 100+2 nodes (m1.xl config)
   * approx 3x the number of rows and data size
   * completes in 6 minutes

You have faster processors.

So it might require more on the order of 25-35 nodes for 10 min
completion. That's a very rough estimate.

Those other two steps (deletes, inserts) might be performed in the
same pass as the compares -- and potentially quicker overall, when you
consider the time to load and recreate in the RDBMS.

Paco


On Wed, Jul 23, 2008 at 9:33 AM, Amber <[EMAIL PROTECTED]> wrote:
> We have a 10 million row table exported from AS400 mainframe every day, the 
> table is exported as a csv text file, which is about 30GB in size, then the 
> csv file is imported into a RDBMS table which is dropped and recreated every 
> day. Now we want to find how many rows are updated during each export-import 
> interval, the table has a primary key, so deletes and inserts can be found 
> using RDBMS joins quickly, but we must do a column to column comparing in 
> order to find the difference between rows ( about 90%) with the same primary 
> keys. Our goal is to find a comparing process which takes no more than 10 
> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
> CPUs, 8GB memory  and a  300G local  RAID5 array.

Reply via email to