In articleâ <[EMAIL PROTECTED]>,â¬
â [EMAIL PROTECTED] wroteâ:â¬
â> â¬Hiâ,â¬
â> â¬I'm trying to get wikipedia page source with urllib2â:â¬
â> â¬usockâ =
â¬urllib2â.â¬urlopenâ("â¬httpâ://â¬en.wikipedia.org/wikiâ/â¬
â> â¬Albert_Einsteinâ")â¬
â> â¬dataâ = â¬usock.readâ();â¬
â> â¬usock.closeâ();â¬
â> â¬return data
â> â¬I got exception because HTTP 403â â¬errorâ. â¬whyâ? â¬with my
browser i can't
â> â¬access it without any problemâ?â¬
â> â¬
â> â¬Thanksâ,â¬
â> â¬Shaharâ.â¬
It appears that Wikipedia may inspect the contents of the User-Agentâ â¬
HTTP headerâ, â¬and that it does not particularly like the string itâ â¬
receives from Python's urllibâ. â¬I was able to make it work with urllibâ
â¬
via the following codeâ:â¬
import urllib
class CustomURLopenerâ (â¬urllib.FancyURLopenerâ):â¬
â â¬versionâ = 'â¬Mozilla/5.0â'â¬
urllibâ.â¬_urlopenerâ = â¬CustomURLopenerâ()â¬
uâ =
â¬urllib.urlopenâ('â¬httpâ://â¬en.wikipedia.org/wiki/Albert_Einsteinâ')â¬
dataâ = â¬u.readâ()â¬
I'm assuming a similar trick could be used with urllib2â, â¬though I
didn'tâ â¬
actually try itâ. â¬Another thing to watch out forâ, â¬is that some
sitesâ â¬
will redirect a public URL X to an internal URL Yâ, â¬and will check thatâ
â¬
access to Y is only permitted if the Referer field indicates coming fromâ â¬
somewhere internal to the siteâ. â¬I have seen both of these techniquesâ
â¬
used to foil screen-scrapingâ.â¬
Cheersâ,â¬
â-â¬M
â-- â¬
Michael Jâ. â¬Frombergerâ | â¬Lecturerâ, â¬Deptâ. â¬of
Computer Science
httpâ://â¬www.dartmouth.eduâ/â¬~stingâ/ | â¬Dartmouth Collegeâ,
â¬Hanoverâ, â¬NHâ, â¬USA
--
http://mail.python.org/mailman/listinfo/python-list