Re: Using MapReduce to do table comparing.

James Moore Fri, 25 Jul 2008 10:11:59 -0700

On Thu, Jul 24, 2008 at 8:03 AM, Amber <[EMAIL PROTECTED]> wrote:
> Yes, I think this is the simplest method , but there are problems too:
>
> 1. The reduce stage wouldn't begin until the map stage ends, by when we have 
> done a two table scanning, and the comparing will take almost the same time, 
> because about 90% of intermediate <key, value> pairs will have two values and 
> different keys, if I can specify a number n, by when there are n intermediate 
> pairs with the same key the reduce tasks start, that will be better. In my 
> case I will set the magic number to 2.


I don't think I understood this completely, but I'll try to respond.

First, I think you're going to be doing something like two full table
scans in any case.  Whether it's in an RDBMS or in Hadoop, you need to
read the complete dataset for both day1 and day2.  (Or at least that's
how I interpreted your original mail - you're not trying to keep
deltas over N days, just doing a delta for yesterday/today from
scratch every time)  You could possibly speed this up by keeping some
kind of parsed data in hadoop for previous days, rather than just
text, but I wouldn't do this as my first solution.

It seems like starting the reducers before the maps are done isn't
going to buy you anything.  The same amount of total work needs to be
done; when the work starts doesn't matter much.  In this case, I'm
guessing that you're going to have a setup where (total number of
maps) == (total number of reducers) == 4 * (number of 4-core
machines).

In any case, I'd say you should do some experiments with the most
simple solution you can come up with.  Your problem seems simple
enough that just banging out some throwaway experimental code is going
to a) not take very long, and b) tell you quite a bit about how your
particular solution is going to behave in the real world.

>
> 2. I am not sure about how Hadoop stores intermediate <key, value> pairs, we 
> would not afford it as data volume increasing if it is kept in memory.

Hadoop is definitely prepared for very large numbers of intermediate
key/value pairs - that's pretty much the normal case for hadoop jobs.
It'll stream to/from disc as necessary.  Take a look at combiners as
well - they may buy you something.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com

Re: Using MapReduce to do table comparing.

Reply via email to