Thank you all for the reply. Looks like the class KeyFieldBasedPartitioner in org.apache.hadoop.mapred.lib can be used in Hadoop streaming to sort both key (like primary key) and value (like secondary key) without data duplication.
It is useful if we have same functionality in the native Java API. James -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 06, 2008 1:53 PM To: core-user@hadoop.apache.org Subject: Re: sort by value On 2/6/08 11:58 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: > >> But it actually adds duplicate data (i.e., the value column which > needs >> sorting) to the key. > > Why? U can always take it out of the value to remove the redundancy. > Actually, you can't in most cases. Suppose you have input data like this: a, b_1 a, b_2 a, b_1 And then the mapper produces data like this for each input record: a, b, 1 a, *, 1 a, b_2, 1 a, *, 1 a, b_1, 1 a, *, 1 If you use the first two fields as the key so that you can sort the records nicely, you get the following inputs to the reducer <a, *>, [3, 2, 1] You now don't know what the counts go to except for the first one. If you replicate the second field in the value output of the map, then you get this <a, *>, [[*,3], [b_1, 2], [b_2, 1]] And you can produce the desired output: a, b_1, 2/3 a, b_2, 1/3