On Wed, Jul 29, 2009 at 1:37 AM, Devajyoti Sarkar <[email protected]> wrote:

> Thank you for a great tip - reusing the key/value objects after
> output.collect.
>
> I have one more question. Is the map output data stored on the local disk
> of
> the instance or is it written out to HDFS. Specifically, if a single map
> outputs more data than the storage size of its local disk, does the job
> fail
> (or can one assume one has the full space of the disk available in HDFS)?
>

In a map-only job, the output is written directly to HDFS. In a job with
reduce phases, the intermediate output ends up on the local disks of the
tasktracker that is running the task. You'll need enough space in your
configured mapred.local.dir to hold it.

Todd


> On Wed, Jul 29, 2009 at 10:06 AM, Jason Venner <[email protected]
> >wrote:
>
> > In hadoop 18 and beyond, the key and value do not have to Implement
> > Writable.
> > As a general rule, the key and value objects passed to the map task will
> be
> > the same objects, with a fresh value initialized by the record reader.
> > The output.collect method will serialize the value during the call
> (unless
> > you are using the chainmapping from 19+), and you are free to reset the
> > values stored in the key value objects passed to output.collect after the
> > call.
> >
> > It is a common practice to have a class field containing an object
> instance
> > of the output key or value type, which are used for transformations,
> > instead
> > of allocating a new key or value instance in each call to map or reduce.
> >
> > On Tue, Jul 28, 2009 at 11:29 AM, Devajyoti Sarkar <[email protected]>
> > wrote:
> >
> > > Thanks.
> > >
> > > Dev
> > >
> > > On Wed, Jul 29, 2009 at 2:27 AM, Todd Lipcon <[email protected]>
> wrote:
> > >
> > > > On Tue, Jul 28, 2009 at 11:24 AM, Devajyoti Sarkar <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > In the hadoop documentation it says that all key-value classes need
> > to
> > > > > implement Writable to allow serialization and de-serialization of
> > > outputs
> > > > > between mappers and reducers. Is this also necessary for key/value
> > > pairs
> > > > > sent between the RecordReader and the Mapper (as well as the
> Reducer
> > > and
> > > > > the
> > > > > RecordWriter)? I assume that each of these two cases, classes are
> > > > > instantiated in the same VM. So is it safe to assume that key/value
> > > pairs
> > > > > are sent by reference instead of serialization/deserialization? If
> > so,
> > > my
> > > > > specific application may get a performance boost. Please do let me
> > know
> > > > if
> > > > > this so.
> > > > >
> > > >
> > > > Yes, this is correct. The values that come out of RecordReaders and
> go
> > > into
> > > > RecordWriters do not need to implement Writable.
> > > >
> > > > -Todd
> > > >
> > >
> >
> >
> >
> > --
> > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> > http://www.amazon.com/dp/1430219424?tag=jewlerymall
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>

Reply via email to