Thanks Aaron. The first option sounds good. How can I ensure to write the partition numbers in a single file while I am writing each partition to a separate file? I mean, Ok after the custom partitioner, an identity reducer would work to write the part-xxxxx file for each partition, but how to write one single file by all reducers containing their partition numbers? Should I do it manually? One possibility: write out all the partition numbers (one per line) to a single file, then use the NLineInputFormat to make each line its own map task. Then in your mapper itself, you will get in a key of "0" or "1" or "2" etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your mapper.
If you wanted to be more clever, it might be possible to subclass MultiFileInputFormat to group together both datasets "file-number-wise" when generating splits, but I don't have specific guidance here. - Aaron On Sat, Jul 3, 2010 at 9:35 AM, abc xyz <[email protected]> wrote: > > > Hello everyone, > > > I have written my custom partitioner for partitioning datasets. I want to > partition two datasets using the same partitioner and then in the next > mapreduce job, I want each mapper to handle the same partition from the > two > sources and perform some function such as joining etc. How I can I ensure > that > one mapper gets the split that corresponds to same partition from both the > sources? > > > Any help would be highly appreciated. > > > > ________________________________ From: Aaron Kimball <[email protected]> To: [email protected] Sent: Mon, July 5, 2010 8:51:44 AM Subject: Re: Partitioned Datasets Map/Reduce
