Switch to crawler-commons version of robots.txt parsing code
------------------------------------------------------------
Key: NUTCH-1008
URL: https://issues.apache.org/jira/browse/NUTCH-1008
Project: Nutch
Issue Type: Improvement
Affects Versions: 1.4
Reporter: Ken Krugler
Priority: Minor
The Bixo project has an improved version of Nutch's robots.txt parsing code.
This was recently contributed to crawler-commons, in a format that should be
independent of Bixo, Cascading, and even Hadoop.
Nutch could switch to this, and benefit from more robust parsing, better
compliance with ad hoc extensions to the robot exclusion protocol, and a wider
community of users/developers for that code.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira