[ 
http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12359237 ] 

Rod Taylor commented on NUTCH-98:
---------------------------------

According to the Googlebot faq their implementation takes the longest matching 
URL as the one they obey.

See point 7 of http://www.google.com/webmasters/bot.html.

Also, there's a small difference between the way Googlebot handles the 
robots.txt file and the way the robots.txt standard says we should (keeping in 
mind the distinction between "should" and "must"). The standard says we should 
obey the first applicable rule, whereas Googlebot obeys the longest (that is, 
the most specific) applicable rule. This more intuitive practice matches what 
people actually do, and what they expect us to do. For example, consider the 
following robots.txt file:

User-Agent: *
Allow: /
Disallow: /cgi-bin 

> RobotRulesParser interprets robots.txt incorrectly
> --------------------------------------------------
>
>          Key: NUTCH-98
>          URL: http://issues.apache.org/jira/browse/NUTCH-98
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7
>     Reporter: Jeff Bowden
>     Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches 
> and incorrectly decides that URLs starting with "/rss" are Disallowed.  The 
> correct algorithm is to take the *longest* rule that matches.  I will attach 
> a patch that fixes this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to