Re: question about when shuffle/sort start working

Todd Lipcon Mon, 01 Jun 2009 07:15:57 -0700

Hi Jianmin,

This is not (currently) supported by Hadoop (or Google's MapReduce either
afaik). What you're looking for sounds like something more like Microsoft's
Dryad.


One thing that is supported in versions of Hadoop after 0.19 is JVM reuse.
If you enable this feature, task trackers will persist JVMs between jobs.
You can then persist some state in static variables.

I'd caution you, however, from making too much use of this fact as anything
but an optimization. The reason that Hadoop is limited to MR (or M+RM* as
you said) is that simplicity and reliability often go hand in hand. If you
start maintaining important state in RAM on the tasktracker JVMs, and one of
them goes down, you may need to restart your entire job sequence from the
top. In typical MapReduce, you may need to rerun a mapper or a reducer, but
the state is all on disk ready to go.

-Todd

On Sun, May 31, 2009 at 11:30 PM, Jianmin Woo <jianmin_...@yahoo.com> wrote:

> Thanks for you quick response, Jothi.
>
> Yes, actually I need each mapper process to handle several rounds of
> map/reduce works without exiting this mapper process. I checked the
> ChainMapper/ChainReducer before, it only support M(+)RM* mode chain of
> mapper and reducer. And more, it seems that how many mapper will be used
> should be specified when configuring the job. I am thinking if it is
> possible to determine how many round of map/reduce will take by the mapper
> itself. Do you think is this feasible, or some hack on the hadoop framework
> to support this?
>
> Thanks,
> Jianmin
>
>
>
>
> ________________________________
> From: Jothi Padmanabhan <joth...@yahoo-inc.com>
> To: core-user@hadoop.apache.org
> Sent: Monday, June 1, 2009 2:03:13 PM
> Subject: Re: question about when shuffle/sort start working
>
>
> No you cannot raise this event yourself, this event is generated internally
> by the framework.
>
> I am guessing that what you probably want is to have a chain of MapReduce
> Jobs where the output of one is automatically fed as input to another.  You
> can look at these classes: JobControl and ChainMapper/ChainReducer.
>
> Jothi
>
> On 6/1/09 11:00 AM, "Jianmin Woo" <jianmin_...@yahoo.com> wrote:
>
> > Thanks a lot for your explanation, Jothi.
> >
> > So is this event generated by hadoop framework? Is there any API in
> mapper to
> > fire this event? Actually, I am thinking to implement a mapper that will
> emit
> > some <key, value> pairs, then fire this event to let the reducer works,
> the
> > same mapper task then emit some other <key, value> pairs and repeat. Do
> you
> > think is this logic feasible by current API?
> >
> > Thanks,
> > Jianmin
> >
> >
> >
> >
> >
> > ________________________________
> > From: Jothi Padmanabhan <joth...@yahoo-inc.com>
> > To: core-user@hadoop.apache.org
> > Sent: Monday, June 1, 2009 12:26:31 PM
> > Subject: Re: question about when shuffle/sort start working
> >
> > When a Mapper completes, MapCompletionEvents are generated. Reducers try
> to
> > fetch map outputs for a given map only on the receipt of such events.
> >
> > Jothi
> >
> >
> > On 5/30/09 10:00 AM, "Jianmin Woo" <jianmin_...@yahoo.com> wrote:
> >
> >> Hi,
> >> I am being confused by the protocol between mapper and reducer. When
> mapper
> >> emitting the (key,value) pair done, is there any signal the mapper send
> out
> >> to
> >> hadoop framework in protocol to indicate that map is done and the
> >> shuffle/sort
> >> can begin for reducer? If there is no this signal in protocol, when the
> >> framework begin the shuffle/sort?
> >>
> >> Thanks,
> >> Jianmin
> >>
> >>
> >>
> >>
> >
> >
> >
>
>
>
>

Re: question about when shuffle/sort start working

Reply via email to