RE: Reduce Sort

Natarajan, Senthil Tue, 08 Apr 2008 10:43:40 -0700

Thanks Ted.

I would like to try using Hadoop.
Do you mean to write another MapReduce program which takes the output of the 
first MapReduce (the already existing file of this format)

IP Add       Count
1.2. 5. 42   27
2.8. 6. 6   24
7.9.24.13   8
7.9. 6. 9    201

And use count as the key and IP Address as the value.
Is it possible to do this in the same program instead of writing another one.
If it is not possible, is it something available in Hadoop once the first 
program is done, can I call
Second program to do the sorting.

If I set the number of reducer to 1, then it will take more time to reduce all 
the maps and hence affect the performance right?

Thanks,
Senthil

-----Original Message-----
From: Ted Dunning [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 08, 2008 11:53 AM
To: core-user@hadoop.apache.org; '[EMAIL PROTECTED]'
Subject: Re: Reduce Sort

There are two ways to do this.  Both of them assume that you have counted
the addresses using map-reduce and the results are in HDFS.

First, since the number of unique IP address is likely to be relatively
small, simply sorting the results using conventional sort is probably as
good as it gets.  This will take just a few lines of scripting code:

   base=http://your-nameserver-here/data
   wget $base/your-results-directory/part-00000 --output-document=a
   wget $base/your-results-directory/part-00001 --output-document=b
   sort -k1nr a b > where-you-want-the-output

It would be convenient if there were a URL that would allow you to retrieve
the concatenation of a wild-carded list of files, but the method I show
above isn't bad.

You are likely to be unhappy at the perceived impurity of this approach, but
I would ask to think about why one might use hadoop at all.  The best reason
is to get high performance on large problems.  The sorting part of this
problem is not all that big a deal and using a conventional sort is probably
the most effective approach here.

You can also do the sorting using hadoop.  Just use a mapper that moves the
count to the key and keeps the IP as the value.  I think that if you use an
IntWritable or LongWritable as the key then the default sorting would give
you ascending order.  You can also define the sort order so that you get
descending order.  Make sure you set the number of reducers to 1 so that you
only get a single output file.

If you have less than 10 million values, the conventional sort is likely to
be faster simply because of hadoop's startup time.

On 4/8/08 8:37 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote:

> Hi,
> I am new to MapReduce.
>
> After slightly modifying the example wordcount, to count the IP Address.
> I have two files part-00000 and part-00001 with the contents something like.
>
> IP Add       Count
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
> 7.9. 6. 9    201
>
> I want to sort it by IP Address count in descending order(i.e.) I would expect
> to see
>
> 7.9. 6. 9    201
> 1.2. 5. 42   27
> 2.8. 6. 6   24
> 7.9.24.13   8
>
> Could you please suggest how to do this.
> And to merge both the partitions (part-00000 and part-00001) in to one output
> file, is there any functions already available in MapReduce Framework.
> Or we need to use Java IO to do this.
> Thanks,
> Senthil

RE: Reduce Sort

Reply via email to