Or we can just seperate shuffle from reduce stage and integrate it to the
map stage
. Then we can clearly differentiate the map stage(before shuffle finish) and
(after shuffle finish)the reduce stage.


On Mon, Sep 19, 2011 at 1:20 AM, He Chen <[email protected]> wrote:

> Hi Kai
>
> Thank you  for the reply.
>
>  The reduce() will not start because the shuffle phase does not finish. And
> the shuffle phase will not finish untill alll mapper end.
>
> I am curious about the design purpose about overlapping the map and reduce
> stage. Was this only for saving shuffling time? Or there are some other
> reasons.
>
> Best wishes!
>
> Chen
>   On Mon, Sep 19, 2011 at 12:36 AM, Kai Voigt <[email protected]> wrote:
>
>> Hi Chen,
>>
>> the times when nodes running instances of the map and reduce nodes
>> overlap. But map() and reduce() execution will not.
>>
>> reduce nodes will start copying data from map nodes, that's the shuffle
>> phase. And the map nodes are still running during that copy phase. My
>> observation had been that if the map phase progresses from 0 to 100%, it
>> matches with the reduce phase progress from 0-33%. For example, if you map
>> progress shows 60%, reduce might show 20%.
>>
>> But the reduce() will not start until all the map() code has processed the
>> entire input. So you will never see the reduce progress higher than 66% when
>> map progress didn't reach 100%.
>>
>> If you see map phase reaching 100%, but reduce phase not making any higher
>> number than 66%, it means your reduce() code is broken or slow because it
>> doesn't produce any output. An infinitive loop is a common mistake.
>>
>> Kai
>>
>> Am 19.09.2011 um 07:29 schrieb He Chen:
>>
>> > Hi Arun
>> >
>> > I have a question. Do you know what is the reason that hadoop allows the
>> map
>> > and the reduce stage overlap? Or anyone knows about it. Thank you in
>> > advance.
>> >
>> > Chen
>> >
>> > On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy <[email protected]>
>> wrote:
>> >
>> >> Nan,
>> >>
>> >> The 'phase' is implicitly understood by the 'progress' (value) made by
>> the
>> >> map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase).
>> >>
>> >> For e.g.
>> >> Reduce:
>> >> 0-33% -> Shuffle
>> >> 34-66% -> Sort (actually, just 'merge', there is no sort in the reduce
>> >> since all map-outputs are sorted)
>> >> 67-100% -> Reduce
>> >>
>> >> With 0.23 onwards the Map has phases too:
>> >> 0-90% -> Map
>> >> 91-100% -> Final Sort/merge
>> >>
>> >> Now,about starting reduces early - this is done to ensure shuffle can
>> >> proceed for completed maps while rest of the maps run, there-by
>> pipelining
>> >> shuffle and map completion. There is a 'reduce slowstart' feature to
>> control
>> >> this - by default, reduces aren't started until 5% of maps are
>> complete.
>> >> Users can set this higher.
>> >>
>> >> Arun
>> >>
>> >> On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote:
>> >>
>> >>> Hi, all
>> >>>
>> >>> recently, I was hit by a question, "how is a hadoop job divided into 2
>> >>> phases?",
>> >>>
>> >>> In textbooks, we are told that the mapreduce jobs are divided into 2
>> >> phases,
>> >>> map and reduce, and for reduce, we further divided it into 3 stages,
>> >>> shuffle, sort, and reduce, but in hadoop codes, I never think about
>> >>> this question, I didn't see any variable members in JobInProgress
>> class
>> >>> to indicate this information,
>> >>>
>> >>> and according to my understanding on the source code of hadoop, the
>> >> reduce
>> >>> tasks are unnecessarily started until all mappers are finished, in
>> >>> constract, we can see the reduce tasks are in shuffle stage while
>> there
>> >> are
>> >>> mappers which are still in running,
>> >>> So how can I indicate the phase which the job is belonging to?
>> >>>
>> >>> Thanks
>> >>> --
>> >>> Nan Zhu
>> >>> School of Electronic, Information and Electrical Engineering,229
>> >>> Shanghai Jiao Tong University
>> >>> 800,Dongchuan Road,Shanghai,China
>> >>> E-Mail: [email protected]
>> >>
>> >>
>>
>>  --
>> Kai Voigt
>> [email protected]
>>
>>
>>
>>
>>
>

Reply via email to