Re: parititioning dataset

Denim Live Tue, 06 Jul 2010 02:38:50 -0700

Hi, 
Yes it makes sense to do the join on reduce-side but I want the other way 
round. One option can be something like this which someone from cloudera 
suggested: "write out all the partition numbers (one per line) to a
single file, then use the NLineInputFormat to make each line its own map
task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your
mapper."

This is one option. Any other suggestions are welcomed.

________________________________
From: Alex Loddengaard <[email protected]>
To: [email protected]
Sent: Mon, July 5, 2010 7:16:02 PM
Subject: Re: parititioning dataset

Hi there, 

Unfortunately you can't control which mapper gets what data.  The InputSplit -> 
map task assignment is random.  You could, however, do the join in the reduce, 
by using an intermediate key as your join key.

Does that make sense?

Alex

On Sat, Jul 3, 2010 at 9:28 AM, Denim Live <[email protected]> wrote:

Hello everyone,
>
>I have written my custom partitioner for partitioning datasets. I want to 
>partition two datasets using the same partitioner and then in the next 
>mapreduce job, I want each mapper to handle the same partition from the two 
>sources and perform some function such as joining etc. How I can I ensure that 
>one mapper gets the split that corresponds to same partition from both the 
>sources? 
>
>Any help would be highly appreciated.
>Alex
>
>
>
>

Re: parititioning dataset

Reply via email to