[
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tejas Patil updated NUTCH-1031:
-------------------------------
Attachment: CC.robots.multiple.agents.patch
I looked at the source code of CC to understand how it works. I have identified
the change to be done to CC so that it supports multiple user agents. While
testing the same, I have found that there a semantic difference in the way CC
works as compared to legacy nutch parser.
*What CC does:*
It will split the _http.robots.agents_ over comma (the change that i did
locally)
It scans the robots file line by line, each time finding if there is a match of
the current "User-Agent" from file with any one of from _http.robots.agents_.
If match is found it will take all the corresponding rules for that agent and
stop further parsing.
{noformat}robots file
User-Agent: Agent1 #foo
Disallow: /a
User-Agent: Agent2 Agent3
Disallow: /d
------------------------------------
http.robots.agents: "Agent2,Agent1"
------------------------------------
Path: "/a"{noformat}
For the example above, as soon as first line of robots file is scanned, a match
for "Agent1" is found. It will scan all the corresponding rules for that agent
and will store only this information:
{noformat}User-Agent: Agent1
Disallow: /a{noformat}
Rest all stuff is neglected.
*What nutch robots parser does:*
It will split the _http.robots.agents_ over comma. It scans ALL the lines of
the robots file and evaluates the matches in terms of the precedence of the
user agents.
For above example, the rules corresponding to both Agent2 and Agent1 have a
match in robots file, but as Agent2 comes first in _http.robots.agents_, it is
given priority and the rules stored will be:
{noformat}User-Agent: Agent2
Disallow: /d{noformat}
If we want to leave behind the precendence based thing and adopt the model in
CC, then I have a small patch for crawler-commons
(CC.robots.multiple.agents.patch).
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
> Issue Type: Task
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Priority: Minor
> Labels: robots.txt
> Fix For: 1.7
>
> Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons
> [http://code.google.com/p/crawler-commons/] which contains a parser for
> robots.txt files. This parser should also be better than the one we currently
> have in Nutch. I will delegate this functionality to CC as soon as it is
> available publicly
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira