It's possible to do the whole thing in one round of map/reduce.
The only requirement is to be able to differentiate between the 2
different types of input files, possibly using different file name
extensions.

One of my coworkers wrote a smart InputFormat class that creates a
different RecordReader for each file type, based on the input file's
extension.  

In each RecordReader, you create a special typed value object for that
input.  So, in your map method, you collect different value objects from
different RecordReaders.  In you reduce method, for each key, you do
necessary processing on the collection based on the value object types.

The main point here is to keep track of the differences from the
beginning to the end, and process them accordingly.

Nathan

-----Original Message-----
From: Colin Freas [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 24, 2008 1:36 PM
To: core-user@hadoop.apache.org
Subject: MapReduce with related data from disparate files

I have a cluster of 5 machines up and accepting jobs, and I'm trying to
work
out how to design my first MapReduce task for the data I have.

So, I wonder if anyone has any experience with the sort of problem I'm
trying to solve, and what the best ways to use Hadoop and MapReduce for
it
are.

I have two sets of related comma delimited files.  One is a set of
unique
records with something like a primary key in one of the fields.  The
other
is a set of records keyed to the first with the primary key.  It's
basically
a request log, and ancillary data about the request.  Something like:

file 1:
asdf, 1, 2, 5, 3 ...
qwer, 3, 6, 2, 7 ...
zxcv, 2, 3, 6, 4 ...

file 2:
asdf, 10, 3
asdf, 3, 2
asdf, 1, 3
zxcv, 3, 1

I basically need to flatten this mapping, and then perform some analysis
on
the result.  I wrote a processing program that runs on a single machines
to
create a map like this:

file 1-2:
asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3
qwer, 3, 6, 2, 7, ... , , , , ,
zxcv, 2, 3, 6, 4, ... , , 3, 1, ,

... where the "flattening" puts in blank values for missing ancillary
data.
I then sample this map taking some small number of entire records for
output, and extrapolate some statistics from the results.

So, what I'd really like to do is figure out exactly what questions I
need
to ask, and instead of sampling, do an enumeration.

Is my best bet to create the conflated date file (that I labeled "file
1-2"
above) in one task, then do analysis using another?  Or is it better to
do
the conflation and aggregation in one step, and then combine those?

I'm not sure how clear this is, but I believe it gets the gist across.

Any thoughts appreciated, any questions answered.



-Colin

Reply via email to