Map-reduce excels at gluing together files like this.

The map phase selects the key and makes sure that you have some way of
telling what the source of the record is.

The reduce phase takes all of the records with the same key and glues them
together.  It can do your processing, but it is also common to require a
subsequent aggregation step.  One typical log processing scenario is joining
access logs to a user database and accumulating aggregate statistics grouped
by some characteristic(s) from the user database.  This would take an
additional aggregation.

Some helpful hints include storing the name of the input file using the
config method for your mapper.  That method will be called each time a new
input file is processed by the mapper.  Another hint is to add the file name
to the values seen by the reducer so you don't have to guess which records
are which.  Another hint is to watch out for reduce steps that don't have
all of the kinds of data you want.  This could be cases where you lack
coverage of the entities in your logs, or records from your ancillary data
that were not referenced in your logs.  Either way, you have to watch out.


On 3/24/08 1:35 PM, "Colin Freas" <[EMAIL PROTECTED]> wrote:

> I have a cluster of 5 machines up and accepting jobs, and I'm trying to work
> out how to design my first MapReduce task for the data I have.
> 
> So, I wonder if anyone has any experience with the sort of problem I'm
> trying to solve, and what the best ways to use Hadoop and MapReduce for it
> are.
> 
> I have two sets of related comma delimited files.  One is a set of unique
> records with something like a primary key in one of the fields.  The other
> is a set of records keyed to the first with the primary key.  It's basically
> a request log, and ancillary data about the request.  Something like:
> 
> file 1:
> asdf, 1, 2, 5, 3 ...
> qwer, 3, 6, 2, 7 ...
> zxcv, 2, 3, 6, 4 ...
> 
> file 2:
> asdf, 10, 3
> asdf, 3, 2
> asdf, 1, 3
> zxcv, 3, 1
> 
> I basically need to flatten this mapping, and then perform some analysis on
> the result.  I wrote a processing program that runs on a single machines to
> create a map like this:
> 
> file 1-2:
> asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
> qwer, 3, 6, 2, 7, ... , , , , ,
> zxcv, 2, 3, 6, 4, ... , , 3, 1, ,
> 
> ... where the "flattening" puts in blank values for missing ancillary data.
> I then sample this map taking some small number of entire records for
> output, and extrapolate some statistics from the results.
> 
> So, what I'd really like to do is figure out exactly what questions I need
> to ask, and instead of sampling, do an enumeration.
> 
> Is my best bet to create the conflated date file (that I labeled "file 1-2"
> above) in one task, then do analysis using another?  Or is it better to do
> the conflation and aggregation in one step, and then combine those?
> 
> I'm not sure how clear this is, but I believe it gets the gist across.
> 
> Any thoughts appreciated, any questions answered.
> 
> 
> 
> -Colin

Reply via email to