Mark,

   Instead of the mapper writing intermediate data that usually goes to the
reducers, the mapper can write directly to HDFS if the job is "map-only."

According to http://hadoop.apache.org/common/docs/r0.20.1/streaming.html

Mapper-Only Jobs

Often, you may want to process input data using a map function only. To do
this, simply set mapred.reduce.tasks to zero. The Map/Reduce framework will
not create any reducer tasks. Rather, the outputs of the mapper tasks will
be the final output of the job.


   ~ Minh

On Mon, Jun 18, 2012 at 10:40 AM, Mark Kerzner <mark.kerz...@shmsoft.com>wrote:

> John,
>
> that sounds very interesting, and I may implement such a workflow, but can
> I write back to HDFS in the mapper? In the reducer it is a standard
> context.write(), but it is a different context.
>
> Thank you,
> Mark
>
> On Mon, Jun 18, 2012 at 9:24 AM, John Armstrong <j...@ccri.com> wrote:
>
> > On 06/18/2012 10:19 AM, Mark Kerzner wrote:
> >
> >> If only reducers could be told to start their work on the first
> >> maps that they see, my processing would begin to show results much
> >> earlier,
> >> before all the mappers are done.
> >>
> >
> > The sort/shuffle phase isn't just about ordering the keys, it's about
> > collecting all the results of the map phase that share a key together for
> > the reducers to work on.  If your reducer can operate on mapper outputs
> > independently of each other, then it sounds like it's really another
> mapper
> > and should be either factored into the mapper or rewritten as a mapper on
> > its own and both mappers thrown into the ChainMapper (if you're using the
> > older API).
> >
>

Reply via email to