[ 
https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757332#comment-17757332
 ] 

Hudson commented on NUTCH-2996:
-------------------------------

FAILURE: Integrated in Jenkins build Nutch ยป Nutch-trunk #103 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/103/])
NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4 
(github: 
[https://github.com/apache/nutch/commit/070c115cfadbc937a8ad0add6447461983e92028])
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java
* (edit) 
src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
* (edit) conf/nutch-default.xml


> Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
> --------------------------------------------------------------------
>
>                 Key: NUTCH-2996
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2996
>             Project: Nutch
>          Issue Type: Improvement
>          Components: robots
>    Affects Versions: 1.20
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.20
>
>
> Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) 
> introduces a new [API entry point to parse the robots.txt 
> content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]:
> - it's more efficient by accepting a collection of lower-cased, single-word 
> user-agent product tokens, without the need to tokenize a (comma-separated) 
> list of user-agent strings again with every robots.txt
> - user-agent matching is compliant with [RFC 9309 (section 
> 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] 
> only if the new API method is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to