Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "WhiteListRobots" page has been changed by ChrisMattmann: https://wiki.apache.org/nutch/WhiteListRobots?action=diff&rev1=1&rev2=2 Comment: - Add example docs + = White List for Robots.txt = + Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list for robots.txt]] capability that can be used to selectively on a per host and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use it. + = List hostnames and/or IP addresses in Nutch conf = + + In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or nutch-site.xml) and add the following information: + + {{{ + <property> + <name>robot.rules.whitelist</name> + <value></value> + <description>Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. + </description> + </property> + }}} + + For example, try this, to whitelist the host, baron.pagemewhen.com: + + {{{ + <property> + <name>robot.rules.whitelist</name> + <value>baron.pagemewhen.com</value> + <description>Comma separated list of hostnames or IP addresses to ignore robot rules parsing for. + </description> + </property> + }}} + + = Testing the configuration = + + Create a sample URLs file to test your whitelist. For example, create a file, call it "url" (without the quotes) and store each URL on a line: + + {{{ + http://baron.pagemewhen.com/~chris/foo1.txt + http://baron.pagemewhen.com/~chris/ + }}} + + Create a sample robots.txt file, e.g., "robots.txt" (without the quotes): + + {{{ + User-agent: * + Disallow: / + }}} + + = Build the Nutch runtime and execute RobotRulesParser = + + Now, build the Nutch runtime, e.g., by running ```ant runtime```. + From your ```runtime/local/```` directory, run this command: + + {{{ + java -cp build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler + }}} + + You should see the following output: + + {{{ + Robots: whitelist: [baron.pagemewhen.com] + Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed + INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored + allowed: http://baron.pagemewhen.com/~chris/foo1.txt + Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules isWhiteListed + INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules parsing will be ignored + allowed: http://baron.pagemewhen.com/~chris/ + }}} +

