RE: sort by value

Qiong Zhang Thu, 07 Feb 2008 16:39:47 -0800

Thank you all for the reply. 

Looks like the class KeyFieldBasedPartitioner in
org.apache.hadoop.mapred.lib can be used in Hadoop streaming to sort
both key (like primary key) and value (like secondary key) without data
duplication.

It is useful if we have same functionality in the native Java API.

James
-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 06, 2008 1:53 PM
To: core-user@hadoop.apache.org
Subject: Re: sort by value

On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:

> 
>> But it actually adds duplicate data (i.e., the value column which
> needs 
>> sorting) to the key.
> 
> Why? U can always take it out of the value to remove the redundancy.
> 

Actually, you can't in most cases.

Suppose you have input data like this:

   a, b_1
   a, b_2
   a, b_1

And then the mapper produces data like this for each input record:

   a, b, 1
   a, *, 1
   a, b_2, 1
   a, *, 1
   a, b_1, 1
   a, *, 1

If you use the first two fields as the key so that you can sort the
records
nicely, you get the following inputs to the reducer

   <a, *>, [3, 2, 1]

You now don't know what the counts go to except for the first one.  If
you
replicate the second field in the value output of the map, then you get
this

   <a, *>, [[*,3], [b_1, 2], [b_2, 1]]

And you can produce the desired output:

   a, b_1, 2/3
   a, b_2, 1/3

RE: sort by value

Reply via email to