Switch to crawler-commons version of robots.txt parsing code
------------------------------------------------------------

                 Key: NUTCH-1008
                 URL: https://issues.apache.org/jira/browse/NUTCH-1008
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.4
            Reporter: Ken Krugler
            Priority: Minor


The Bixo project has an improved version of Nutch's robots.txt parsing code.

This was recently contributed to crawler-commons, in a format that should be 
independent of Bixo, Cascading, and even Hadoop.

Nutch could switch to this, and benefit from more robust parsing, better 
compliance with ad hoc extensions to the robot exclusion protocol, and a wider 
community of users/developers for that code.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to