Yes, you are correct. I had not thought about sharing a file handle through multiple tasks via jvm reuse.
On Thu, Jun 18, 2009 at 9:43 AM, Tarandeep Singh <tarand...@gmail.com>wrote: > Jason, correct me if I am wrong- > > opening Sequence file in the configure (or setup method in 0.20) and > writing > to it is same as doing output.collect( ), unless you mean I should make the > sequence file writer static variable and set reuse jvm flag to -1. In that > case the subsequent mappers might be run in the same JVM and they can use > the same writer and hence produce one file. But in that case I need to add > a > hook to close the writer - may be use the shutdown hook. > > Jothi, the idea of combine input format is good. But I guess I have to > write > somethign of my own to make it work in my case. > > Thanks guys for the suggestions... but I feel we should have some support > from the framework to merge the output of mapper only job so that we don't > get a lot number of smaller files. Sometimes you just don't want to run > reducers and unnecessarily transfer a whole lot of data across the network. > > Thanks, > Tarandeep > > On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop <jason.had...@gmail.com > >wrote: > > > You can open your sequence file in the mapper configure method, write to > it > > in your map, and close it in the mapper close method. > > Then you end up with 1 sequence file per map. I am making an assumption > > that > > each key,value to your map some how represents a single xml file/item. > > > > On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan < > joth...@yahoo-inc.com > > >wrote: > > > > > You could look at CombineFileInputFormat to generate a single split out > > of > > > several files. > > > > > > Your partitioner would be able to assign keys to specific reducers, but > > you > > > would not have control on which node a given reduce task will run. > > > > > > Jothi > > > > > > > > > On 6/18/09 5:10 AM, "Tarandeep Singh" <tarand...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > Can I restrict the output of mappers running on a node to go to > > > reducer(s) > > > > running on the same node? > > > > > > > > Let me explain why I want to do this- > > > > > > > > I am converting huge number of XML files into SequenceFiles. So > > > > theoretically I don't even need reducers, mappers would read xml > files > > > and > > > > output Sequencefiles. But the problem with this approach is I will > end > > up > > > > getting huge number of small output files. > > > > > > > > To avoid generating large number of smaller files, I can Identity > > > reducers. > > > > But by running reducers, I am unnecessarily transfering data over > > > network. I > > > > ran some test case using a small subset of my data (~90GB). With map > > only > > > > jobs, my cluster finished conversion in only 6 minutes. But with map > > and > > > > Identity reducers job, it takes around 38 minutes. > > > > > > > > I have to process close to a terabyte of data. So I was thinking of a > > > faster > > > > alternatives- > > > > > > > > * Writing a custom OutputFormat > > > > * Somehow restrict output of mappers running on a node to go to > > reducers > > > > running on the same node. May be I can write my own partitioner > > (simple) > > > but > > > > not sure how Hadoop's framework assigns partitions to reduce tasks. > > > > > > > > Any pointers ? > > > > > > > > Or this is not possible at all ? > > > > > > > > Thanks, > > > > Tarandeep > > > > > > > > > > > > -- > > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > > http://www.amazon.com/dp/1430219424?tag=jewlerymall > > www.prohadoopbook.com a community for Hadoop Professionals > > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals