Re: Using MapReduce to do table comparing.

Jason Venner Wed, 23 Jul 2008 09:30:10 -0700

If you write a SequenceFile with the results from the RDBM you can use
the join primitives to handle this rapidly.


The key is that you have to write the data in the native key sort order.

Since you have a primary key, you should be able to dump the table in
primary key order, and you can define a comparator (if needed) such that
hadoop will consider that order to be the key sort order.

If you just want to do random lookup of keys, write a MapFile and then
you can do rapid random accesss. (Same rules above apply for writing the
MapFile)


I believe in version of hadoop later than 16.3 the join primitives work
for text files also, if the text file is sorted.


Amber wrote:
> We have a 10 million row table exported from AS400 mainframe every day, the 
> table is exported as a csv text file, which is about 30GB in size, then the 
> csv file is imported into a RDBMS table which is dropped and recreated every 
> day. Now we want to find how many rows are updated during each export-import 
> interval, the table has a primary key, so deletes and inserts can be found 
> using RDBMS joins quickly, but we must do a column to column comparing in 
> order to find the difference between rows ( about 90%) with the same primary 
> keys. Our goal is to find a comparing process which takes no more than 10 
> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz 
> CPUs, 8GB memory  and a  300G local  RAID5 array.
>
> Bellow is our current solution:
>     The old data is kept in the RDBMS with index created on the primary key, 
> the new data is imported into HDFS as the input file of our Map-Reduce job. 
> Every map task connects to the RDBMS database, and selects old data from it 
> for every row, map tasks will generate outputs if differences are found, and 
> there are no reduce tasks.
>
> As you can see, with the number of concurrent map tasks increasing, the RDBMS 
> database will become the bottleneck, so we want to kick off the RDBMS, but we 
> have no idea about how to retrieve the old row with a given key quickly from 
> HDFS files, any suggestion is welcome.

Re: Using MapReduce to do table comparing.

Reply via email to