Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "WhiteListRobots" page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/WhiteListRobots?action=diff&rev1=2&rev2=3

  
  Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list 
for robots.txt]] capability that can be used to selectively on a per host 
and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use 
it.
  
- = List hostnames and/or IP addresses in Nutch conf = 
+ == List hostnames and/or IP addresses in Nutch conf ==
  
  In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or 
nutch-site.xml) and add the following information:
  
@@ -28, +28 @@

  </property>
  }}}
  
- = Testing the configuration =
+ == Testing the configuration ==
  
  Create a sample URLs file to test your whitelist. For example, create a file, 
call it "url" (without the quotes) and store each URL on a line:
  
@@ -44, +44 @@

  Disallow: /
  }}}
  
- = Build the Nutch runtime and execute RobotRulesParser =
+ == Build the Nutch runtime and execute RobotRulesParser ==
  
  Now, build the Nutch runtime, e.g., by running ```ant runtime```.
  From your ```runtime/local/```` directory, run this command:

Reply via email to