RobotRulesParser
----------------
Key: NUTCH-101
URL: http://issues.apache.org/jira/browse/NUTCH-101
Project: Nutch
Type: Bug
Components: fetcher
Versions: 0.7, 0.8-dev
Reporter: Fuad Efendi
I noticed this code in protocol-http & protocol-httpclient plugins:
} else if ( (line.length() >= 6)
&& (line.substring(0, 6).equalsIgnoreCase("Allow:")) ) {
However, according to the original 1994 protocol description, there is NO
"Allow:" field. To allow, simply use "Disallow: ".
http://www.robotstxt.org/wc/norobots.html
Please, try to test with www.newegg.com/robots.txt
- their site has this:
User-agent: *
Disallow:
And Nutch does not work with New Egg, but it should!
Sorry guys, I don't have enough time to double-ensure, could you please verify
all this...
I noticed strange discussion at nutch-agent:lucene.apache.org, it seems that we
need to test ......./robots.txt
User-agent: ia_archiver
Disallow: /
User-agent: Googlebot-Image
Disallow: /
User-agent: Nutch
Disallow: /
User-agent: TurnitinBot
Disallow: /
- everything according to standard protocol. Can you retest please whether it
works with multiline? It's a standard!
I see this in code:
StringTokenizer tok = new StringTokenizer(agentNames, ",");
Comma separated? It's not accepted standard yet...
Sorry WebExpertsAmerica, I really didn't have any time to make any test...
Please do not execute tests against production sites.
Thanks!
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers