You could look at CombineFileInputFormat to generate a single split out of several files.
Your partitioner would be able to assign keys to specific reducers, but you would not have control on which node a given reduce task will run. Jothi On 6/18/09 5:10 AM, "Tarandeep Singh" <tarand...@gmail.com> wrote: > Hi, > > Can I restrict the output of mappers running on a node to go to reducer(s) > running on the same node? > > Let me explain why I want to do this- > > I am converting huge number of XML files into SequenceFiles. So > theoretically I don't even need reducers, mappers would read xml files and > output Sequencefiles. But the problem with this approach is I will end up > getting huge number of small output files. > > To avoid generating large number of smaller files, I can Identity reducers. > But by running reducers, I am unnecessarily transfering data over network. I > ran some test case using a small subset of my data (~90GB). With map only > jobs, my cluster finished conversion in only 6 minutes. But with map and > Identity reducers job, it takes around 38 minutes. > > I have to process close to a terabyte of data. So I was thinking of a faster > alternatives- > > * Writing a custom OutputFormat > * Somehow restrict output of mappers running on a node to go to reducers > running on the same node. May be I can write my own partitioner (simple) but > not sure how Hadoop's framework assigns partitions to reduce tasks. > > Any pointers ? > > Or this is not possible at all ? > > Thanks, > Tarandeep