Hi there, When I went through the source code of Nutch - the ParseSegment class, which is the class to "parse content in a segment". Here is its map reduce job configuration part. http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup (Line 199 - 213)
199 JobConf job = new NutchJob(getConf());200 job.setJobName("parse " + segment);201 202 FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));203 job.set(Nutch.SEGMENT_NAME_KEY, segment.getName()); 204 job.setInputFormat(SequenceFileInputFormat.class);205job.setMapperClass(ParseSegment.class); 206 job.setReducerClass(ParseSegment.class);207 208FileOutputFormat.setOutputPath(job, segment); 209 job.setOutputFormat(ParseOutputFormat.class);210job.setOutputKeyClass(Text.class); 211 job.setOutputValueClass(ParseImpl.class);212 213 JobClient.runJob(job); Here, in line 202 and line 208, the map reduce input/output path has been configured by calling methods addInputPath/setOutputPath from FileInputFormat. And it is the absolute path in the Linux OS instead of HDFS virtual path. And on the other hand, when I look at the WordCount example in the hadoop homepage. https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55) 39. JobConf conf = new JobConf(WordCount.class);40. conf.setJobName("wordcount");41.42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);44.45. conf.setMapperClass(Map.class);46. conf.setCombinerClass(Reduce.class); 47. conf.setReducerClass(Reduce.class);48.49. conf.setInputFormat(TextInputFormat.class);50. conf.setOutputFormat(TextOutputFormat.class);51.52. FileInputFormat.setInputPaths(conf, new Path(args[0]));53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));54.55. JobClient.runJob(conf); Here, the input/output path was configured in the same way as Nutch but the path was actually passed by passing the arguments. bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output And we can see the paths passed to the program are actually HDFS path.. not Linux OS path.. I am confused here is there some other configuration that I missed which lead to the run environment difference? In which case, should I pass absolute or HDFS path? Thanks a lot! /usr/bin