[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

Tejas Patil (JIRA) Sun, 20 Jan 2013 02:10:19 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tejas Patil updated NUTCH-1031:
-------------------------------

    Attachment: CC.robots.multiple.agents.patch

I looked at the source code of CC to understand how it works. I have identified 
the change to be done to CC so that it supports multiple user agents. While 
testing the same, I have found that there a semantic difference in the way CC 
works as compared to legacy nutch parser.

*What CC does:*
It will split the _http.robots.agents_ over comma (the change that i did 
locally)
It scans the robots file line by line, each time finding if there is a match of 
the current "User-Agent" from file with any one of from  _http.robots.agents_. 
If match is found it will take all the corresponding rules for that agent and 
stop further parsing. 

{noformat}robots file
User-Agent: Agent1 #foo
Disallow: /a

User-Agent: Agent2 Agent3
Disallow: /d
------------------------------------
http.robots.agents: "Agent2,Agent1"
------------------------------------
Path: "/a"{noformat}

For the example above, as soon as first line of robots file is scanned, a match 
for "Agent1" is found. It will scan all the corresponding rules for that agent 
and will store only this information:
{noformat}User-Agent: Agent1
Disallow: /a{noformat}

Rest all stuff is neglected.

*What nutch robots parser does:*
It will split the _http.robots.agents_ over comma. It scans ALL the lines of 
the robots file and evaluates the matches in terms of the precedence of the 
user agents.
For above example, the rules corresponding to both Agent2 and Agent1 have a 
match in robots file, but as Agent2 comes first in _http.robots.agents_, it is 
given priority and the rules stored will be:
{noformat}User-Agent: Agent2
Disallow: /d{noformat}

If we want to leave behind the precendence based thing and adopt the model in 
CC, then I have a small patch for crawler-commons 
(CC.robots.multiple.agents.patch).
                
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>
>                 Key: NUTCH-1031
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1031
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>
>         Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

Reply via email to