Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by Gal Nitzan:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
  
  If you're fetching regularly, segments older than the 
db.default.fetch.interval can be deleted, as their pages should have been 
refetched. This is 30 days by default.
  
+ === MapReduce ===
+ 
+ ==== How to start working with MapReduce? ====
+ 
+ edit conf/nutch-site.xml
+ 
+   <property>
+     <name>fs.default.name</name>
+     <value>localhost:9000</value>
+     <description>The name of the default file system. Either the literal 
string "local" or a host:port for NDFS.</description>
+   </property>
+ 
+   <property>
+     <name>mapred.job.tracker</name>
+     <value>localhost:9001</value>
+     <description>The host and port that the MapReduce job tracker runs at. If 
"local", then jobs are run in-process as a single map and reduce 
task.</description>
+   </property>
+ 
+ 
+ edit conf/mapred-default.xml
+   <property>
+     <name>mapred.map.tasks</name>
+     <value>4</value>
+     <description>define mapred.map.tasks to be multiple of number of slave 
hosts
+ </description>
+   </property>
+ 
+   <property>
+     <name>mapred.reduce.tasks</name>
+     <value>2</value>
+     <description>define mapred.reduce tasks to be number of slave 
hosts</description>
+   </property>
+ 
+ create a file with slave host names
+ 
+ {{{
+ % echo localhost >> ~/.slaves
+ % echo somemachin >> ~/.slaves}}}
+ 
+ start all ndfs & mapred daemons
+ {{{
+ % bin/start-all.sh
+ }}}
+ 
+ create a directory with seed list file
+ {{{
+ % mkdir seeds
+ % echo http://www.cnn/com/ > seeds/urls
+ }}}
+ 
+ put seed directory in ndfs
+ {{{
+ % bin/nutch ndfs -put seeds seeds
+ }}}
+ 
+ crawl a bit
+ {{{
+ % bin/nutch crawl seeds -depth 3
+ }}}
+ 
+ monitor things from adminstrative interface
+ open browser and enter your masterHost:7845
+ 
+ 
  === Searching ===
  
  ==== Common words are saturating my search results. ====

Reply via email to