Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "WhiteListRobots" page has been changed by ChrisMattmann:
https://wiki.apache.org/nutch/WhiteListRobots?action=diff&rev1=3&rev2=4

Comment:
- documentation update

  From your ```runtime/local/```` directory, run this command:
  
  {{{
- java -cp 
build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.1.1.jar
 org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler
+ java -cp 
build/apache-nutch-1.10-SNAPSHOT.job:build/apache-nutch-1.10-SNAPSHOT.jar:runtime/local/lib/hadoop-core-1.2.0.jar:runtime/local/lib/crawler-commons-0.5.jar:runtime/local/lib/slf4j-log4j12-1.6.1.jar:runtime/local/lib/slf4j-api-1.7.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/guava-11.0.2.jar:runtime/local/lib/commons-logging-1.2.jar:runtime/local/lib/commons-cli-1.2.jar
 org.apache.nutch.protocol.RobotRulesParser robots.txt urls Nutch-crawler
  }}}
  
  You should see the following output:
  
  {{{
- Robots: whitelist: [baron.pagemewhen.com]
+ Whitelisted hosts: [baron.pagemewhen.com]
- Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
- INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules 
parsing will be ignored
+ Whitelisted hosts: [baron.pagemewhen.com]
+ Whitelisted hosts: [baron.pagemewhen.com]
- allowed:      http://baron.pagemewhen.com/~chris/foo1.txt
+ whitelisted:  http://baron.pagemewhen.com/~chris/foo1.txt
- Apr 14, 2015 11:53:20 PM org.apache.nutch.protocol.WhiteListRobotRules 
isWhiteListed
- INFO: Host: [baron.pagemewhen.com] is whitelisted and robots.txt rules 
parsing will be ignored
- allowed:      http://baron.pagemewhen.com/~chris/
+ whitelisted:  http://baron.pagemewhen.com/~chris/
  }}}
  

Reply via email to