It's possible to do the whole thing in one round of map/reduce. The only requirement is to be able to differentiate between the 2 different types of input files, possibly using different file name extensions.
One of my coworkers wrote a smart InputFormat class that creates a different RecordReader for each file type, based on the input file's extension. In each RecordReader, you create a special typed value object for that input. So, in your map method, you collect different value objects from different RecordReaders. In you reduce method, for each key, you do necessary processing on the collection based on the value object types. The main point here is to keep track of the differences from the beginning to the end, and process them accordingly. Nathan -----Original Message----- From: Colin Freas [mailto:[EMAIL PROTECTED] Sent: Monday, March 24, 2008 1:36 PM To: core-user@hadoop.apache.org Subject: MapReduce with related data from disparate files I have a cluster of 5 machines up and accepting jobs, and I'm trying to work out how to design my first MapReduce task for the data I have. So, I wonder if anyone has any experience with the sort of problem I'm trying to solve, and what the best ways to use Hadoop and MapReduce for it are. I have two sets of related comma delimited files. One is a set of unique records with something like a primary key in one of the fields. The other is a set of records keyed to the first with the primary key. It's basically a request log, and ancillary data about the request. Something like: file 1: asdf, 1, 2, 5, 3 ... qwer, 3, 6, 2, 7 ... zxcv, 2, 3, 6, 4 ... file 2: asdf, 10, 3 asdf, 3, 2 asdf, 1, 3 zxcv, 3, 1 I basically need to flatten this mapping, and then perform some analysis on the result. I wrote a processing program that runs on a single machines to create a map like this: file 1-2: asdf, 1, 2, 5, 3, ... 10, 3, 3, 2, 1, 3 qwer, 3, 6, 2, 7, ... , , , , , zxcv, 2, 3, 6, 4, ... , , 3, 1, , ... where the "flattening" puts in blank values for missing ancillary data. I then sample this map taking some small number of entire records for output, and extrapolate some statistics from the results. So, what I'd really like to do is figure out exactly what questions I need to ask, and instead of sampling, do an enumeration. Is my best bet to create the conflated date file (that I labeled "file 1-2" above) in one task, then do analysis using another? Or is it better to do the conflation and aggregation in one step, and then combine those? I'm not sure how clear this is, but I believe it gets the gist across. Any thoughts appreciated, any questions answered. -Colin