Thanks Ted. I would like to try using Hadoop. Do you mean to write another MapReduce program which takes the output of the first MapReduce (the already existing file of this format)
IP Add Count 1.2. 5. 42 27 2.8. 6. 6 24 7.9.24.13 8 7.9. 6. 9 201 And use count as the key and IP Address as the value. Is it possible to do this in the same program instead of writing another one. If it is not possible, is it something available in Hadoop once the first program is done, can I call Second program to do the sorting. If I set the number of reducer to 1, then it will take more time to reduce all the maps and hence affect the performance right? Thanks, Senthil -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 08, 2008 11:53 AM To: core-user@hadoop.apache.org; '[EMAIL PROTECTED]' Subject: Re: Reduce Sort There are two ways to do this. Both of them assume that you have counted the addresses using map-reduce and the results are in HDFS. First, since the number of unique IP address is likely to be relatively small, simply sorting the results using conventional sort is probably as good as it gets. This will take just a few lines of scripting code: base=http://your-nameserver-here/data wget $base/your-results-directory/part-00000 --output-document=a wget $base/your-results-directory/part-00001 --output-document=b sort -k1nr a b > where-you-want-the-output It would be convenient if there were a URL that would allow you to retrieve the concatenation of a wild-carded list of files, but the method I show above isn't bad. You are likely to be unhappy at the perceived impurity of this approach, but I would ask to think about why one might use hadoop at all. The best reason is to get high performance on large problems. The sorting part of this problem is not all that big a deal and using a conventional sort is probably the most effective approach here. You can also do the sorting using hadoop. Just use a mapper that moves the count to the key and keeps the IP as the value. I think that if you use an IntWritable or LongWritable as the key then the default sorting would give you ascending order. You can also define the sort order so that you get descending order. Make sure you set the number of reducers to 1 so that you only get a single output file. If you have less than 10 million values, the conventional sort is likely to be faster simply because of hadoop's startup time. On 4/8/08 8:37 AM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote: > Hi, > I am new to MapReduce. > > After slightly modifying the example wordcount, to count the IP Address. > I have two files part-00000 and part-00001 with the contents something like. > > IP Add Count > 1.2. 5. 42 27 > 2.8. 6. 6 24 > 7.9.24.13 8 > 7.9. 6. 9 201 > > I want to sort it by IP Address count in descending order(i.e.) I would expect > to see > > 7.9. 6. 9 201 > 1.2. 5. 42 27 > 2.8. 6. 6 24 > 7.9.24.13 8 > > Could you please suggest how to do this. > And to merge both the partitions (part-00000 and part-00001) in to one output > file, is there any functions already available in MapReduce Framework. > Or we need to use Java IO to do this. > Thanks, > Senthil