[Nutch Wiki] Update of "WhiteListRobots" by ChrisMattmann

Apache Wiki Wed, 15 Apr 2015 15:36:33 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "WhiteListRobots" page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/WhiteListRobots?action=diff&rev1=1&rev2=2

Comment:
- Add example docs

+ = White List for Robots.txt =
+ 
  Nutch now has a [[https://issues.apache.org/jira/browse/NUTCH-1927|white list 
for robots.txt]] capability that can be used to selectively on a per host 
and/or IP basis turn on/off robots.txt parsing. Read on to find out how to use 
it.
  
+ = List hostnames and/or IP addresses in Nutch conf = 
+ 
+ In the Nutch configuration directory (conf/), edit nutch-default.xml (and/or 
nutch-site.xml) and add the following information:
+ 
+ {{{
+ <property>
+   <name>robot.rules.whitelist</name>
+   <value></value>
+   <description>Comma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
+   </description>
+ </property>
+ }}}
+ 
+ For example, try this, to whitelist the host, baron.pagemewhen.com:
+ 
+ {{{
+ <property>
+   <name>robot.rules.whitelist</name>
+   <value>baron.pagemewhen.com</value>
+   <description>Comma separated list of hostnames or IP addresses to ignore 
robot rules parsing for.
+   </description>
+ </property>
+ }}}
+ 
+ = Testing the configuration =
+ 
+ Create a sample URLs file to test your whitelist. For example, create a file, 
call it "url" (without the quotes) and store each URL on a line:
+ 
+ {{{
+ http://baron.pagemewhen.com/~chris/foo1.txt
+ http://baron.pagemewhen.com/~chris/
+ }}}
+ 
+ Create a sample robots.txt file, e.g., "robots.txt" (without the quotes):
+ 
+ {{{
+ User-agent: *
+ Disallow: /
+ }}}
+ 
+ = Build the Nutch runtime and execute RobotRulesParser =
+ 
+ Now, build the Nutch runtime, e.g., by running ```ant runtime```.
+ From your ```runtime/local/```` directory, run this command:
+ 
+ {{{
+ java -cp 
build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar
 org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler
+ }}}
+ 
+ You should see the following output:
+ 
+ {{{
+ Robots: whitelist: [baron.pagemewhen.com]
+ Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
+ INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules 
parsing will be ignored
+ allowed:      http://baron.pagemewhen.com/~chris/foo1.txt
+ Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
+ INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules 
parsing will be ignored
+ allowed:      http://baron.pagemewhen.com/~chris/
+ }}}
+

[Nutch Wiki] Update of "WhiteListRobots" by ChrisMattmann

Reply via email to