Re: Behavior of nutch-site.xml vs. hadoop-site.xml

Dennis Kubes Fri, 02 Mar 2007 07:07:02 -0800

Here goes with the short answer. :)

Configuration has two levels, default and final. It is supplied by theorg.apache.hadoop.conf.Configuration class and extended in Nutch by theorg.apache.nutch.util.NutchConfiguration class.

Although it is configurable, by default hadoop-default.xml andnutch-default.xml are default resources and hadoop-site.xml andnutch-site.xml are final resources. Resources (i.e. resource files) canbe added by filename to either the default or final resource set and infact this is how Nutch extends the Configuration class, by addingnutch-default.xml and nutch-site.xml.

Final resource values overwrite default resource values and finalresource values added later will overwrite final resource values addedearlier. When I say values I am talking about the individual propertiesnot the resource files. Resource files are found by name in theclasspath with the HADOOP_CONF_DIR or NUTCH_CONF_DIR being configured inthe nutch and hadoop scripts as the first setting in the classpath. Youcan change the conf dir to pull configuration files from differentdirectories and many tools in nutch and hadoop now provide a -confoptions on the command line to set the conf directory.

So for your example if you define the property in hadoop-default.xml ornutch-default.xml and it is not defined in either hadoop-site.xml ornutch-site.xml then the property will stand. If you define the propertyin either nutch-site.xml or hadoop-site.xml then it will overridenutch-default.xml and hadoop-default.xml settings. And if you define itin both hadoop-site.xml and nutch-site.xml then the nutch-site.xml willoverride the hadoop-site.xml settings because nutch-site.xml is addedafter hadoop-site.xml. And remember only individual properties areoverridden not the entire file.

Practically you should define properties having to do with Hadoop (i.e.the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having todo with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml.


Dennis Kubes

Ricardo J. Méndez wrote:

Hi Gal,

Thanks for the reply.

What has me wondering is that several other plugins _are_ being loaded
when I define it on hadoop-site.xml, and actually that defining
plugin.folders on that file is the only way I've found so far of getting
plugins loaded at all when testing from Eclipse.

Moreover, I get this problem even if I define it in both nutch-site and
hadoop-site, which would make it seem that the definition in
hadoop-site.xml does have an effect.  I was assuming they overrode the
options from nutch-site.xml - am I mistaken?


Ricardo J. Méndez
http://ricardo.strangevistas.net/

Gal Nitzan wrote:

Hi,

Nutch loads its configuration from nutch-site and nutch-default.xml and not
from hadoop conf files so the behavior is correct.

HTH,

Gal.


On 3/1/07, "Ricardo J. Méndez" <[EMAIL PROTECTED]> wrote:

Hi,

I'm using nutch-0.9, from the trunk.    I've noticed a behavior
difference on a plugin unit test if I set the plugin.folders property on
nutch-site.xml vs. hadoop-site.xml.  If I set it on nutch-site.xml, the
unit test works well, but an error is raised if it's on hadoop-site.xml

The error is:

   [junit]  WARN [main] (ParserFactory.java:196) - Canno initialize
parser parse-html (cause:
org.apache.nutch.plugin.PluginRuntimeException:
java.lang.ClassNotFoundException: org.apache.nutch.parse.html.HtmlParser


Is there a reason why the HtmlParser wouldn't be loaded when the
directory is specified on hadoop-site.xml?

Thanks in advance,




Ricardo J. Méndez
http://ricardo.strangevistas.net/

Re: Behavior of nutch-site.xml vs. hadoop-site.xml

Reply via email to