Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

Bin Wang Fri, 03 Jan 2014 10:25:10 -0800

Hi Tejas,

Thanks a lot for your response, now I completely understand how WordCount
example read path as HDFS path because you use `hadoop` command to call the
WordCount.jar. And `hadoop` configuration says:
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>
...


However, Nutch 1.7 can be installed without Hadoop preinstalled. Where does
Nutch read the filesystem configuration? there is no core-site.xml for
Nutch.
Isn't it? then it is default as local ?

/usr/bin




On Thu, Jan 2, 2014 at 10:02 PM, Tejas Patil <[email protected]>wrote:

> The config 'fs.default.name' of core-site.xml is what makes this happen.
> Its default value is "file:///" which corresponds to local mode of Hadoop.
> In local mode Hadoop looks for paths on the local file system. In
> distributed mode of Hadoop, 'fs.default.name' would be
> "hdfs://IP_OF_NAMENODE/" and it will look for those paths in HDFS.
>
> Thanks,
> Tejas
>
>
> On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang <[email protected]> wrote:
>
>> Hi there,
>>
>> When I went through the source code of Nutch - the ParseSegment class,
>> which is the class to "parse content in a segment". Here is its map reduce
>> job configuration part.
>>
>> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
>>   (Line
>> 199 - 213)
>>
>> 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName("parse " +
>> segment); 201  202 FileInputFormat.addInputPath(job, new Path(segment,
>> Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY,
>> segment.getName()); 204
>> job.setInputFormat(SequenceFileInputFormat.class); 205
>> job.setMapperClass(ParseSegment.class); 206
>> job.setReducerClass(ParseSegment.class); 207  208 
>> FileOutputFormat.setOutputPath(job,
>> segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210
>> job.setOutputKeyClass(Text.class); 211
>> job.setOutputValueClass(ParseImpl.class); 212  213 JobClient.runJob(job);
>>
>> Here, in line 202 and line 208, the map reduce input/output path has been
>> configured by calling methods addInputPath/setOutputPath from
>> FileInputFormat.
>> And it is the absolute path in the Linux OS instead of HDFS virtual path.
>>
>> And on the other hand, when I look at the WordCount example in the hadoop
>> homepage.
>> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55)
>>
>> 39.      JobConf conf = new JobConf(WordCount.class); 40.
>> conf.setJobName("wordcount"); 41. 42.
>> conf.setOutputKeyClass(Text.class); 43.
>> conf.setOutputValueClass(IntWritable.class); 44. 45.
>> conf.setMapperClass(Map.class); 46.
>> conf.setCombinerClass(Reduce.class); 47.
>> conf.setReducerClass(Reduce.class); 48. 49.
>> conf.setInputFormat(TextInputFormat.class); 50.
>> conf.setOutputFormat(TextOutputFormat.class); 51. 52.     
>> FileInputFormat.setInputPaths(conf,
>> new Path(args[0])); 53.      FileOutputFormat.setOutputPath(conf, new
>> Path(args[1])); 54. 55.     JobClient.runJob(conf);
>> Here, the input/output path was configured in the same way as Nutch but
>> the path was actually passed by passing the arguments.
>> bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
>> /usr/joe/wordcount/input /usr/joe/wordcount/output
>> And we can see the paths passed to the program are actually HDFS path..
>>  not Linux OS path..
>> I am confused here is there some other configuration that I missed which
>> lead to the run environment difference? In which case, should I pass
>> absolute or HDFS path?
>>
>> Thanks a lot!
>>
>> /usr/bin
>>
>>
>

Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

Reply via email to