Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by Gal Nitzan: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ If you're fetching regularly, segments older than the db.default.fetch.interval can be deleted, as their pages should have been refetched. This is 30 days by default. + === MapReduce === + + ==== How to start working with MapReduce? ==== + + edit conf/nutch-site.xml + + <property> + <name>fs.default.name</name> + <value>localhost:9000</value> + <description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description> + </property> + + <property> + <name>mapred.job.tracker</name> + <value>localhost:9001</value> + <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description> + </property> + + + edit conf/mapred-default.xml + <property> + <name>mapred.map.tasks</name> + <value>4</value> + <description>define mapred.map.tasks to be multiple of number of slave hosts + </description> + </property> + + <property> + <name>mapred.reduce.tasks</name> + <value>2</value> + <description>define mapred.reduce tasks to be number of slave hosts</description> + </property> + + create a file with slave host names + + {{{ + % echo localhost >> ~/.slaves + % echo somemachin >> ~/.slaves}}} + + start all ndfs & mapred daemons + {{{ + % bin/start-all.sh + }}} + + create a directory with seed list file + {{{ + % mkdir seeds + % echo http://www.cnn/com/ > seeds/urls + }}} + + put seed directory in ndfs + {{{ + % bin/nutch ndfs -put seeds seeds + }}} + + crawl a bit + {{{ + % bin/nutch crawl seeds -depth 3 + }}} + + monitor things from adminstrative interface + open browser and enter your masterHost:7845 + + === Searching === ==== Common words are saturating my search results. ====
