[
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558400#comment-13558400
]
Ken Krugler commented on NUTCH-1031:
------------------------------------
Regarding precedence - my guess is that it's not very important, as I haven't
seen many (any?) robots.txt files where it would match the same robot, using
related names, in rules blocks with different rules.
This issue of precedence is specific to Nutch users, however (not part of the
robots.txt RFC) so I'd suggest posting to the Nutch users list to see if anyone
thinks it's important.
As far as your review of the CC code, yes it's correct. There's one additional
wrinkle in that the target user agent name is split on spaces, due to what
appears to be an implicit expectation that you can use a user agent name with
spaces (which based on the RFC isn't actually valid) and any piece of the name
will match.
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
> Issue Type: Task
> Reporter: Julien Nioche
> Assignee: Tejas Patil
> Priority: Minor
> Labels: robots.txt
> Fix For: 1.7
>
> Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons
> [http://code.google.com/p/crawler-commons/] which contains a parser for
> robots.txt files. This parser should also be better than the one we currently
> have in Nutch. I will delegate this functionality to CC as soon as it is
> available publicly
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira