karl added the comment: → python Python 2.7.5 (default, Mar 9 2014, 22:15:05) [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import robotparser >>> rp = robotparser.RobotFileParser('http://somesite.test.site/robots.txt') >>> rp.read() >>>
Let's check the server logs: 127.0.0.1 - - [23/Jun/2014:08:44:37 +0900] "GET /robots.txt HTTP/1.0" 200 92 "-" "Python-urllib/1.17" Robotparser by default was using in 2.* the Python-urllib/1.17 user agent which is traditionally blocked by many sysadmins. A solution has been already proposed above: This is the proposed test for 3.4 import urllib.robotparser import urllib.request opener = urllib.request.build_opener() opener.addheaders = [('User-agent', 'MyUa/0.1')] urllib.request.install_opener(opener) rp = urllib.robotparser.RobotFileParser('http://localhost:9999') rp.read() The issue is not anymore about changing the lib, but just about documenting on how to change the RobotFileParser default UA. We can change the title of this issue if it's confusing. Or close it and open a new one for documenting what makes it easier :) Currently robotparser.py imports urllib user agent. http://hg.python.org/cpython/file/7dc94337ef67/Lib/urllib/request.py#l364 It's a common failure we encounter when using urllib in general, including robotparser. As for wikipedia, they fixed their server side user agent sniffing, and do not filter anymore python-urllib. GET /robots.txt HTTP/1.1 Accept: */* Accept-Encoding: gzip, deflate, compress Host: en.wikipedia.org User-Agent: Python-urllib/1.17 HTTP/1.1 200 OK Accept-Ranges: bytes Age: 3161 Cache-control: s-maxage=3600, must-revalidate, max-age=0 Connection: keep-alive Content-Encoding: gzip Content-Length: 5208 Content-Type: text/plain; charset=utf-8 Date: Sun, 22 Jun 2014 23:59:16 GMT Last-modified: Tue, 26 Nov 2013 17:39:43 GMT Server: Apache Set-Cookie: GeoIP=JP:Tokyo:35.6850:139.7514:v4; Path=/; Domain=.wikipedia.org Vary: X-Subdomain Via: 1.1 varnish, 1.1 varnish, 1.1 varnish X-Article-ID: 19292575 X-Cache: cp1065 miss (0), cp4016 hit (1), cp4009 frontend hit (215) X-Content-Type-Options: nosniff X-Language: en X-Site: wikipedia X-Varnish: 2529666795, 2948866481 2948865637, 4134826198 4130750894 Many other sites still do. :) ---------- versions: +Python 3.4 -Python 3.5 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue15851> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com