The method to describe is the standard approach.
The benefit is that the data that arrives at the reducer might be larger than you want to store in memory (for sorting by the reduce). Also, reading the entire set of reduce values would increase the amount of data allocated and would mean that you would need to make two passes over each reduce set (at least). Sorting in the shuffle phase is essentially free. One conventional use of this sorting is to ensure that summary data is processed before other data. For instance, if you are estimating conditional probabilities p(b | a) and you have counts k(a, b) and k(a, *) then it is nice to reduce on a so that you get k(a, *), k(a, b_1), k(a, b_2)... as the input to the reducer. With a simple sort, you can guarantee that the k(a,*) value comes first which makes it easier to computer k(a,b)/k(a,*) since you would already have the value of k(a,*) handy. Another, much more obscure, use is in co-grouping where sorting in random order can help minimize memory use for temporary buffering as you split the reduce values into two or more lists or if you implement iterators over virtual lists. On 2/6/08 11:24 AM, "Qiong Zhang" <[EMAIL PROTECTED]> wrote: > > Hi, All, > > Is there a better way to sort by value in the same key before reaching > reducers? > > I know it can be achieved by using > setOutputValueGroupingComparator/setOutputKeyComparatorClass. > > But it actually adds duplicate data (i.e., the value column which needs > sorting) to the key. > > Also, I wonder what is the benefit to sort values before reaching > reducers. > It can be achieved in the reduce phase anyway. > > Thanks, > James