Re: sort by value

Ted Dunning Wed, 06 Feb 2008 13:45:29 -0800


The method to describe is the standard approach.

The benefit is that the data that arrives at the reducer might be larger
than you want to store in memory (for sorting by the reduce).  Also, reading
the entire set of reduce values would increase the amount of data allocated
and would mean that you would need to make two passes over each reduce set
(at least).  Sorting in the shuffle phase is essentially free.

One conventional use of this sorting is to ensure that summary data is
processed before other data.  For instance, if you are estimating
conditional probabilities p(b | a) and you have counts k(a, b) and k(a, *)
then it is nice to reduce on a so that you get k(a, *), k(a, b_1), k(a,
b_2)... as the input to the reducer.  With a simple sort, you can guarantee
that the k(a,*) value comes first which makes it easier to computer
k(a,b)/k(a,*) since you would already have the value of k(a,*) handy.

Another, much more obscure, use is in co-grouping where sorting in random
order can help minimize memory use for temporary buffering as you split the
reduce values into two or more lists or if you implement iterators over
virtual lists.

On 2/6/08 11:24 AM, "Qiong Zhang" <[EMAIL PROTECTED]> wrote:

> 
> Hi, All,
> 
> Is there a better way to sort by value in the same key before reaching
> reducers?
> 
> I know it can be achieved by using
> setOutputValueGroupingComparator/setOutputKeyComparatorClass.
> 
> But it actually adds duplicate data (i.e., the value column which needs
> sorting) to the key.
> 
> Also, I wonder what is the benefit to sort values before reaching
> reducers.
> It can be achieved in the reduce phase anyway.
> 
> Thanks,
> James

Re: sort by value

Reply via email to