[issue17403] Robotparser fails to parse some robots.txt

Łukasz Langa added the comment:

robotparser implements http://www.robotstxt.org/orig.html, there's even a link 
to this document at http://docs.python.org/3/library/urllib.robotparser.html. 
As mher points out, there's a newer version of that spec formed as RFC: 
http://www.robotstxt.org/norobots-rfc.txt. It introduces Allow, specifies how 
percentage encoding should be treated and how to handle expiration.


Moreover, there is a de facto standard agreed by Google, Yahoo and Microsoft in 
2008, documented by their respective blog posts:

http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html

http://www.ysearchblog.com/2008/06/03/one-standard-fits-all-robots-exclusion-protocol-for-yahoo-google-and-microsoft/

http://www.bing.com/blogs/site_blogs/b/webmaster/archive/2008/06/03/robots-exclusion-protocol-joining-together-to-provide-better-documentation.aspx

For reference, there are two third-party robots.txt parsers out there 
implementing these extensions:

- https://pypi.python.org/pypi/reppy
- https://pypi.python.org/pypi/robotexclusionrulesparser

We need to decide how to incorporate those new features while maintaining 
backwards compatibility concerns.

----------
assignee:  -> lukasz.langa

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue17403>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue17403] Robotparser fails to parse some robots.txt

Reply via email to