Hi Nan

I have the same question for a while. In some research papers, people like
to make the reduce stage to be slow start. In this way, the map stage and
reduce stage are easy to differentiate. You can use the number of remaining
unallocated map tasks to detect in which stage your job is.

To let the reduce stage overlap with the map stage, it blurs the boundary
between two stages. I think it may decreases the execution time of the whole
job (I am not sure whether this is the main reason that people allow "fast
start" happen or not).

However, "fast start" has its side-effect. It is hard to get a global view
of the map stage's output, and then the reduce stage's balance and data
locality are not easy to be solved.

Chen
Research Assistant of Holland Computing Center
PhD student of CSE Department
University of Nebraska-Lincoln


On Sun, Sep 18, 2011 at 9:24 PM, Nan Zhu <[email protected]> wrote:

> Hi, all
>
>  recently, I was hit by a question, "how is a hadoop job divided into 2
> phases?",
>
> In textbooks, we are told that the mapreduce jobs are divided into 2
> phases,
> map and reduce, and for reduce, we further divided it into 3 stages,
> shuffle, sort, and reduce, but in hadoop codes, I never think about
> this question, I didn't see any variable members in JobInProgress class
> to indicate this information,
>
> and according to my understanding on the source code of hadoop, the reduce
> tasks are unnecessarily started until all mappers are finished, in
> constract, we can see the reduce tasks are in shuffle stage while there are
> mappers which are still in running,
> So how can I indicate the phase which the job is belonging to?
>
> Thanks
> --
> Nan Zhu
> School of Electronic, Information and Electrical Engineering,229
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: [email protected]
>

Reply via email to