Dennis Kubes wrote:
> Practically you should define properties having to do with Hadoop 
> (i.e. the DFS, Mapreduce, etc) in the hadoop-site.xml and properties 
> having to do with Nutch (i.e. fetcher, url-normalizers, etc) in the 
> nutch-site.xml.

There are other two important config files:

* mapred-default.xml - this is loaded as default resource when a new 
map-reduce JobConf is created - which means that it is loaded as the 
last default resource when you prepare the job configuration. Usually 
you should keep its content to a bare minimum. This is the best place to 
specify the default number of map and reduce tasks per job. If you feel 
adventurous you could also put some other stuff there, e.g. set the 
default compression with mapred.compress.map.output and so on.

* job.xml - this file is created dynamically, and represents a 
serialized JobConf. When map-reduce tasks are started they read this 
file as their last default resource (note - this is NOT a final 
resource!). So, if you accidentally distributed mapred-default.xml to 
all cluster nodes, but in your job you specified a different number of 
map or reduce tasks, your settings will take precedence. The same with 
other settings, such as e.g. the compression setting.

HOWEVER ... a common error is to put too many properties such as default 
number of map and reduce tasks in hadoop-site.xml. As Dennis explained, 
this is a final resource - which means that the values you specify there 
will ALWAYS override your job settings. This is bad, so don't do it ;) - 
put them in mapred-default.xml.

In other words: use hadoop-site.xml only for things that are always the 
same for the whole cluster, such as the FS name, jobtracker name:port, 
temp directories, etc. because values you put there will always override 
your job settings. And do not put there things that are job-dependent.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to