Hi there,
When I went through the source code of Nutch - the ParseSegment class,
which is the class to "parse content in a segment". Here is its map reduce
job configuration part.
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
(Line
199 - 213)
199 JobConf job = new NutchJob(getConf());200 job.setJobName("parse " +
segment);201 202 FileInputFormat.addInputPath(job, new Path(segment,
Content.DIR_NAME));203 job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
204
job.setInputFormat(SequenceFileInputFormat.class);205job.setMapperClass(ParseSegment.class);
206 job.setReducerClass(ParseSegment.class);207
208FileOutputFormat.setOutputPath(job, segment);
209
job.setOutputFormat(ParseOutputFormat.class);210job.setOutputKeyClass(Text.class);
211 job.setOutputValueClass(ParseImpl.class);212 213 JobClient.runJob(job);
Here, in line 202 and line 208, the map reduce input/output path has been
configured by calling methods addInputPath/setOutputPath from
FileInputFormat.
And it is the absolute path in the Linux OS instead of HDFS virtual path.
And on the other hand, when I look at the WordCount example in the hadoop
homepage.
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55)
39. JobConf conf = new JobConf(WordCount.class);40.
conf.setJobName("wordcount");41.42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);44.45.
conf.setMapperClass(Map.class);46. conf.setCombinerClass(Reduce.class);
47. conf.setReducerClass(Reduce.class);48.49.
conf.setInputFormat(TextInputFormat.class);50.
conf.setOutputFormat(TextOutputFormat.class);51.52.
FileInputFormat.setInputPaths(conf,
new Path(args[0]));53. FileOutputFormat.setOutputPath(conf, new
Path(args[1]));54.55. JobClient.runJob(conf);
Here, the input/output path was configured in the same way as Nutch but the
path was actually passed by passing the arguments.
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
/usr/joe/wordcount/input /usr/joe/wordcount/output
And we can see the paths passed to the program are actually HDFS path..
not Linux OS path..
I am confused here is there some other configuration that I missed which
lead to the run environment difference? In which case, should I pass
absolute or HDFS path?
Thanks a lot!
/usr/bin