Sebastian Nagel created NUTCH-2996:
--------------------------------------

             Summary: Use new SimpleRobotRulesParser API entry point 
(crawler-commons 1.4)
                 Key: NUTCH-2996
                 URL: https://issues.apache.org/jira/browse/NUTCH-2996
             Project: Nutch
          Issue Type: Improvement
          Components: robots
    Affects Versions: 1.20
            Reporter: Sebastian Nagel
             Fix For: 1.20


Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) 
introduces a new [API entry point to parse the robots.txt 
content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]:
- it's more efficient by accepting a collection of lower-cased, single-word 
user-agent product tokens, without the need to tokenize a (comma-separated) 
list of user-agent strings again with every robots.txt
- user-agent matching is compliant with [RFC 9309 (section 
2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] 
only if the new API method is used




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to