Re: Significance of file.out.index during Shuffle Phase ?

Pavan Kulkarni Mon, 20 Aug 2012 20:15:47 -0700

Arun,

  Yes got it now.  Well what I am trying to do is store the intermediate
data on a shared File System and create hardlinks to the
MapOutputs(file.out) spilled by the Map nodes. This eliminates the copy
phase of Shuffle stage.
 But now learning that the data for different reducers is partitioned
across the same file(file.out) creating hardlinks wouldn't serve the
purpose.Isn't it? Or is there a way to do it.?
Please correct me if am wrong at any assumption. Thanks


On Sun, Aug 19, 2012 at 10:54 PM, Arun C Murthy <[email protected]> wrote:

> You'll need to make significant changes MapTask.java which won't make it
> back to the mainline.
>
> Why? We had this before and quickly ran out of inodes on the local-disk.
> Think of large jobs with 10,000 maps * 1000 reduces -> that's 10M files.
>
> Arun
>
> On Aug 19, 2012, at 8:57 AM, Pavan Kulkarni wrote:
>
> > Ohh ,Thanks a lot Harsh. Exactly what I was looking for.
> > I wanted to create different file.out's for different reducers. Something
> > like
> > file.out.1 for reducer 1, file.out.2 for reducer etc. Is it possible to
> do
> > this in the MapReduce program or I need to tweak some Hadoop source files
> > for that? Thanks.
> >
> > On Sun, Aug 19, 2012 at 7:02 AM, Harsh J <[email protected]> wrote:
> >
> >> Hey Pavan,
> >>
> >> Yes you've got it almost right on how file.out is served to each
> >> reducer. See the code at
> >>
> >>
> http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java?view=markup
> >> (Method under L502:L565 that sends data for a specific
> >> reduce/partition ID (integer)).
> >>
> >> On Sun, Aug 19, 2012 at 9:05 AM, Pavan Kulkarni <
> [email protected]>
> >> wrote:
> >>> Hi,
> >>>
> >>>  I was trying to understand how exactly the reducers find out how to
> >> fetch
> >>> the data of its own partition from Map nodes.
> >>> During the executions of MapReduce, I see that *file.out* is created on
> >> Map
> >>> nodes, so my question is how does a reducer
> >>> know what part of file.out to fetch? Is the *file.out.index* play any
> >> role?
> >>> Any help is appreciated .Thanks
> >>>
> >>>
> >>>
> >>> --With Regards
> >>> Pavan Kulkarni
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
> >
> > --
> >
> > --With Regards
> > Pavan Kulkarni
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


-- 

--With Regards
Pavan Kulkarni

Re: Significance of file.out.index during Shuffle Phase ?

Reply via email to