Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by AndrzejBialecki:
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial

The comment on the change is:
Corrected some misunderstanding about the config files.

------------------------------------------------------------------------------
  }}}
  
  === Configure Hadoop ===
+ Edit the mapred-default.xml configuration file. If it's missing, create it, 
with the following content:
+ {{{
+ <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+ 
+ <configuration>
+ 
+ <property> 
+   <name>mapred.map.tasks</name>
+   <value>2</value>
+   <description>
+     This should be a prime number larger than multiple number of slave hosts,
+     e.g. for 3 nodes set this to 17
+   </description> 
+ </property> 
+ 
+ <property> 
+   <name>mapred.reduce.tasks</name>
+   <value>2</value>
+   <description>
+     This should be a prime number close to a low multiple of slave hosts,
+     e.g. for 3 nodes set this to 7
+   </description> 
+ </property> 
+ 
+ </configuration>
+ }}}
+ 
+ ''Note: do NOT put these properties in hadoop-site.xml. That file (see below) 
should only contain properties characteristic to the cluster, and not to the 
job. Misplacing these properties leads to strange and difficult to debug 
problems - e.g. inability to specify the number of map/reduce tasks 
programmatically (Generator and Fetcher depend on this).''
+ 
  Edit the hadoop-site.xml configuration file.
  {{{
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
@@ -115, +144 @@

    </description>
  </property>
  
- <property> 
+ <property>
-   <name>mapred.map.tasks</name>
+   <name>mapred.tasktracker.tasks.maximum</name>
    <value>2</value>
    <description>
-     define mapred.map tasks to be number of slave hosts
+     The maximum number of tasks that will be run simultaneously by
+     a task tracker. This should be adjusted according to the heap size
+     per task, the amount of RAM available, and CPU consumption of each task.
-   </description> 
+   </description>
- </property> 
+ </property>
  
- <property> 
+ <property>
-   <name>mapred.reduce.tasks</name>
+   <name>mapred.child.java.opts</name>
-   <value>2</value>
+   <value>-Xmx200m</value>
    <description>
-     define mapred.reduce tasks to be number of slave hosts
+     You can specify other Java options for each map or reduce task here,
+     but most likely you will want to adjust the heap size.
-   </description> 
+   </description>
- </property> 
+ </property>
  
  <property>
    <name>dfs.name.dir</name>
@@ -462, +494 @@

  
  = Comments =
  == Number of map reduce tasks ==
- I noticed that the number of map and reduce task has an impact on the 
performance of Hadoop. Many times after crawling a lot of pages the nodes 
reported 'java.lang.OutOfMemoryError: Java heap space' errors, this happend 
also in the indexing part. Increasing the number of maps solved these problems, 
with an index that has over 200.000 pages I needed 306 maps in total over 3 
machines. By setting the mapred.maps.tasks property in hadoop-site.xml to 99 
(much higher than what is advised in other tutorials and in the hadoop-site.xml 
file) that problem is solved.
+ I noticed that the number of map and reduce task has an impact on the 
performance of Hadoop. Many times after crawling a lot of pages the nodes 
reported 'java.lang.OutOfMemoryError: Java heap space' errors, this happened 
also in the indexing part. Increasing the number of maps solved these problems, 
with an index that has over 200.000 pages I needed 306 maps in total over 3 
machines. By setting the mapred.maps.tasks property in hadoop-site.xml to 99 
(much higher than what is advised in other tutorials and in the hadoop-site.xml 
file) that problem is solved.
+ 
+ ''(ab) As noted above, do NOT set the number of map/reduce tasks in 
hadoop-site.xml, put them in mapred-default.xml instead.''
  
  See [http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces] for more 
info about the number of map reduce tasks.
  

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to