RE: Pre-sort value list in reduce

Goel, Ankur Tue, 15 Apr 2008 23:24:11 -0700

For simulating a secondary sort on values the API
JobConf.setOutputValueGroupingComparator()
should be used where in you are required to provide a comparator class
to handle comparisons on values
of the key.


The comparator will be passed all the values for a given key in the
*Shuffle* phase after all the 
values for the key have been collected.

So for key/value pairs 
<hello, 3>
<hello, 4>
<hello, 2>
<hello, 1>

Your comparator would be required to do comparisons between any 2 values
of key 'hello'
that are numeric in this example.

So first, the comparator should be aware of value types and do necessary
casting, second the value
themselves should be comparable.

Hope this helps.

-Ankur


-----Original Message-----
From: pi song [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 15, 2008 6:30 PM
To: [email protected]
Subject: Re: Pre-sort value list in reduce

Arkady,

Isn't the partitioner for just redirecting map output to the right
reduce bucket? What I want is each value list in reduce being sorted.

Pi

On Tue, Apr 15, 2008 at 7:40 PM, phonechen <[EMAIL PROTECTED]> wrote:

> HI arkady,
> I 'm also confuse on how does the hadoop framework do this job:
>  transfering  many <key,value> pair of the output in the map() phase 
> to <key ,list of value> before the reduce() phase.
> such as Map() output:
>  <hello,1>
> <hello,1>
> <world,1>
> <hello,1>
>  <world,1>
> but the reduce() input is:
> <hello,[1,1,1}>
> <world,[1,1]>
> Can you point me out which class take care of these?
> Thanks very much!
>
> Best Regards,
>
> Yours
> Phonechen
>
> On 4/15/08, arkady borkovsky <[EMAIL PROTECTED]> wrote:
> >
> > look at
> >  -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
> >
> > --ab
> >
> > On Apr 14, 2008, at 4:25 PM, pi song wrote:
> >
> > Dear people in Hadoop mailing list,
> > >
> > > Is there any way to control the value list in reduce (Key, List of
> > > values)
> > > to be sorted? or at least clusteringly sorted (containing clusters

> > > of sorted values e.g. 1,1,1,2,2,2,2,3,3,3,  1,1,1,1,1,1,2,2,2,2,3
> > > ,1,1,2,2,2,3,3,3,3,3,3,3) ?
> > > I had a look at JobConf.setOutputValueGroupingComparator in 
> > > javadoc
> and
> > > I
> > > think it might be the answer because I feel most of the time 
> > > grouping
> in
> > > Hadoop is done by sort. Am I right?
> > >
> > > Can anyone help me? How about the performance impact of your
solution?
> > >
> > > Thanks in advance,
> > > Pi
> > >
> >
> >
>
>
> --
> --~--~---------~--~----~------------~-------~--
>
> Best Regards,
>
> Yours
> Phonechen
>
> -~----------~----~----~----~------~----~------
>

RE: Pre-sort value list in reduce

Reply via email to