> 1. Do I need to setup files specially for them to work with sort? My self-made test > files always causes the map tasks to fail. They're just text files with lines such as > "123456 abcdef", "789012 ghijkl", etc.
What sort of failures do you get? I am assuming that you want to sort based on the value of the first numerical string in every line. I don't know what inputformat you are using. So, for your example, the closest built-in inputformat is TextInputFormat. If you use that you would get the key and value as the offset (as a LongWritable object) of the line in the file and the line (as a Text object). You would then need to extract the two parts, create a IntWritable or a LongWritable object of the first part, and, create a Text object of the second part, and then do output.collect treating the first part as the key and the second part as the value. Have a look at org.apache.hadoop.examples.WordCount.MapClass.map to get a better feel for what I am saying. > 2. How do I check to make sure the sort output is truly sorted, when using the > randomwriter + sort test? Is there any specific way to view the output files? You could use the SortValidator. So, if you ran sort on the input directory, inputDir, and created the output directory, outputDir, run the sortvalidator as bin/hadoop jar build/hadoop-<version#>-dev-test.jar testmapredsort -sortInput <input-path> -sortOutput <sort-output> > 3. Are the outputs of the test programs typically part-00000, part-00001, ...part-XXXXX? > Is there any suggested method for merging them? Yes. You could run another mapreduce job with exactly one reduce to merge them. -----Original Message----- From: Kevin Lim [mailto:[EMAIL PROTECTED] Sent: Friday, June 29, 2007 2:56 AM To: [email protected] Subject: Sort inputs, outputs Hi, I have setup hadoop on 2 machines and am now trying to see if it is working properly. I have 3 questions: 1. Do I need to setup files specially for them to work with sort? My self-made test files always causes the map tasks to fail. They're just text files with lines such as "123456 abcdef", "789012 ghijkl", etc. 2. How do I check to make sure the sort output is truly sorted, when using the randomwriter + sort test? Is there any specific way to view the output files? 3. Are the outputs of the test programs typically part-00000, part-00001, ...part-XXXXX? Is there any suggested method for merging them? Thanks, Kevin Lim
