Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by AndrzejBialecki: http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial The comment on the change is: Corrected some misunderstanding about the config files. ------------------------------------------------------------------------------ }}} === Configure Hadoop === + Edit the mapred-default.xml configuration file. If it's missing, create it, with the following content: + {{{ + <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> + + <configuration> + + <property> + <name>mapred.map.tasks</name> + <value>2</value> + <description> + This should be a prime number larger than multiple number of slave hosts, + e.g. for 3 nodes set this to 17 + </description> + </property> + + <property> + <name>mapred.reduce.tasks</name> + <value>2</value> + <description> + This should be a prime number close to a low multiple of slave hosts, + e.g. for 3 nodes set this to 7 + </description> + </property> + + </configuration> + }}} + + ''Note: do NOT put these properties in hadoop-site.xml. That file (see below) should only contain properties characteristic to the cluster, and not to the job. Misplacing these properties leads to strange and difficult to debug problems - e.g. inability to specify the number of map/reduce tasks programmatically (Generator and Fetcher depend on this).'' + Edit the hadoop-site.xml configuration file. {{{ <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> @@ -115, +144 @@ </description> </property> - <property> + <property> - <name>mapred.map.tasks</name> + <name>mapred.tasktracker.tasks.maximum</name> <value>2</value> <description> - define mapred.map tasks to be number of slave hosts + The maximum number of tasks that will be run simultaneously by + a task tracker. This should be adjusted according to the heap size + per task, the amount of RAM available, and CPU consumption of each task. - </description> + </description> - </property> + </property> - <property> + <property> - <name>mapred.reduce.tasks</name> + <name>mapred.child.java.opts</name> - <value>2</value> + <value>-Xmx200m</value> <description> - define mapred.reduce tasks to be number of slave hosts + You can specify other Java options for each map or reduce task here, + but most likely you will want to adjust the heap size. - </description> + </description> - </property> + </property> <property> <name>dfs.name.dir</name> @@ -462, +494 @@ = Comments = == Number of map reduce tasks == - I noticed that the number of map and reduce task has an impact on the performance of Hadoop. Many times after crawling a lot of pages the nodes reported 'java.lang.OutOfMemoryError: Java heap space' errors, this happend also in the indexing part. Increasing the number of maps solved these problems, with an index that has over 200.000 pages I needed 306 maps in total over 3 machines. By setting the mapred.maps.tasks property in hadoop-site.xml to 99 (much higher than what is advised in other tutorials and in the hadoop-site.xml file) that problem is solved. + I noticed that the number of map and reduce task has an impact on the performance of Hadoop. Many times after crawling a lot of pages the nodes reported 'java.lang.OutOfMemoryError: Java heap space' errors, this happened also in the indexing part. Increasing the number of maps solved these problems, with an index that has over 200.000 pages I needed 306 maps in total over 3 machines. By setting the mapred.maps.tasks property in hadoop-site.xml to 99 (much higher than what is advised in other tutorials and in the hadoop-site.xml file) that problem is solved. + + ''(ab) As noted above, do NOT set the number of map/reduce tasks in hadoop-site.xml, put them in mapred-default.xml instead.'' See [http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces] for more info about the number of map reduce tasks. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs