[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

Ken Krugler (JIRA) Sun, 20 Jan 2013 13:56:13 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558400#comment-13558400
 ]


Ken Krugler commented on NUTCH-1031:
------------------------------------

Regarding precedence - my guess is that it's not very important, as I haven't 
seen many (any?) robots.txt files where it would match the same robot, using 
related names, in rules blocks with different rules.

This issue of precedence is specific to Nutch users, however (not part of the 
robots.txt RFC) so I'd suggest posting to the Nutch users list to see if anyone 
thinks it's important.

As far as your review of the CC code, yes it's correct. There's one additional 
wrinkle in that the target user agent name is split on spaces, due to what 
appears to be an implicit expectation that you can use a user agent name with 
spaces (which based on the RFC isn't actually valid) and any piece of the name 
will match.
                
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>
>                 Key: NUTCH-1031
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1031
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Tejas Patil
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>
>         Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

Reply via email to