You could look at CombineFileInputFormat to generate a single split out of
several files.

Your partitioner would be able to assign keys to specific reducers, but you
would not have control on which node a given reduce task will run.

Jothi


On 6/18/09 5:10 AM, "Tarandeep Singh" <tarand...@gmail.com> wrote:

> Hi,
> 
> Can I restrict the output of mappers running on a node to go to reducer(s)
> running on the same node?
> 
> Let me explain why I want to do this-
> 
> I am converting huge number of XML files into SequenceFiles. So
> theoretically I don't even need reducers, mappers would read xml files and
> output Sequencefiles. But the problem with this approach is I will end up
> getting huge number of small output files.
> 
> To avoid generating large number of smaller files, I can Identity reducers.
> But by running reducers, I am unnecessarily transfering data over network. I
> ran some test case using a small subset of my data (~90GB). With map only
> jobs, my cluster finished conversion in only 6 minutes. But with map and
> Identity reducers job, it takes around 38 minutes.
> 
> I have to process close to a terabyte of data. So I was thinking of a faster
> alternatives-
> 
> * Writing a custom OutputFormat
> * Somehow restrict output of mappers running on a node to go to reducers
> running on the same node. May be I can write my own partitioner (simple) but
> not sure how Hadoop's framework assigns partitions to reduce tasks.
> 
> Any pointers ?
> 
> Or this is not possible at all ?
> 
> Thanks,
> Tarandeep

Reply via email to