Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "WhiteListRobots" page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/WhiteListRobots?action=diff&rev1=2&rev2=3 Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list for robots.txt]] capability that can be used to selectively on a per host and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use it. - = List hostnames and/or IP addresses in Nutch conf = + == List hostnames and/or IP addresses in Nutch conf == In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or nutch-site.xml) and add the following information: @@ -28, +28 @@ </property> }}} - = Testing the configuration = + == Testing the configuration == Create a sample URLs file to test your whitelist. For example, create a file, call it "url" (without the quotes) and store each URL on a line: @@ -44, +44 @@ Disallow: / }}} - = Build the Nutch runtime and execute RobotRulesParser = + == Build the Nutch runtime and execute RobotRulesParser == Now, build the Nutch runtime, e.g., by running ```ant runtime```. From your ```runtime/local/```` directory, run this command:

