Yes, you are correct. I had not thought about sharing a file handle through
multiple tasks via jvm reuse.


On Thu, Jun 18, 2009 at 9:43 AM, Tarandeep Singh <tarand...@gmail.com>wrote:

> Jason, correct me if I am wrong-
>
> opening Sequence file in the configure (or setup method in 0.20) and
> writing
> to it is same as doing output.collect( ), unless you mean I should make the
> sequence file writer static variable and set reuse jvm flag to -1. In that
> case the subsequent mappers might be run in the same JVM and they can use
> the same writer and hence produce one file. But in that case I need to add
> a
> hook to close the writer - may be use the shutdown hook.
>
> Jothi, the idea of combine input format is good. But I guess I have to
> write
> somethign of my own to make it work in my case.
>
> Thanks guys for the suggestions... but I feel we should have some support
> from the framework to merge the output of mapper only job so that we don't
> get a lot number of smaller files. Sometimes you just don't want to run
> reducers and unnecessarily transfer a whole lot of data across the network.
>
> Thanks,
> Tarandeep
>
> On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop <jason.had...@gmail.com
> >wrote:
>
> > You can open your sequence file in the mapper configure method, write to
> it
> > in your map, and close it in the mapper close method.
> > Then you end up with 1 sequence file per map. I am making an assumption
> > that
> > each key,value to your map some how represents a single xml file/item.
> >
> > On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan <
> joth...@yahoo-inc.com
> > >wrote:
> >
> > > You could look at CombineFileInputFormat to generate a single split out
> > of
> > > several files.
> > >
> > > Your partitioner would be able to assign keys to specific reducers, but
> > you
> > > would not have control on which node a given reduce task will run.
> > >
> > > Jothi
> > >
> > >
> > > On 6/18/09 5:10 AM, "Tarandeep Singh" <tarand...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > Can I restrict the output of mappers running on a node to go to
> > > reducer(s)
> > > > running on the same node?
> > > >
> > > > Let me explain why I want to do this-
> > > >
> > > > I am converting huge number of XML files into SequenceFiles. So
> > > > theoretically I don't even need reducers, mappers would read xml
> files
> > > and
> > > > output Sequencefiles. But the problem with this approach is I will
> end
> > up
> > > > getting huge number of small output files.
> > > >
> > > > To avoid generating large number of smaller files, I can Identity
> > > reducers.
> > > > But by running reducers, I am unnecessarily transfering data over
> > > network. I
> > > > ran some test case using a small subset of my data (~90GB). With map
> > only
> > > > jobs, my cluster finished conversion in only 6 minutes. But with map
> > and
> > > > Identity reducers job, it takes around 38 minutes.
> > > >
> > > > I have to process close to a terabyte of data. So I was thinking of a
> > > faster
> > > > alternatives-
> > > >
> > > > * Writing a custom OutputFormat
> > > > * Somehow restrict output of mappers running on a node to go to
> > reducers
> > > > running on the same node. May be I can write my own partitioner
> > (simple)
> > > but
> > > > not sure how Hadoop's framework assigns partitions to reduce tasks.
> > > >
> > > > Any pointers ?
> > > >
> > > > Or this is not possible at all ?
> > > >
> > > > Thanks,
> > > > Tarandeep
> > >
> > >
> >
> >
> > --
> > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> > http://www.amazon.com/dp/1430219424?tag=jewlerymall
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Reply via email to