En Tue, 05 Aug 2008 06:59:20 -0300, Steven D'Aprano <[EMAIL PROTECTED]> escribió:
> On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote: > >> En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano >> <[EMAIL PROTECTED]> escribi�: >> >>> I'm using urllib.urlretrieve() to download HTML pages, and I've hit a >>> snag with URLs containing ampersands: >>> >>> http://www.example.com/parrot.php?x=1&y=2 >>> >>> Somewhere in the process, urls like the above are escaped to: >>> >>> http://www.example.com/parrot.php?x=1&y=2 >>> >>> which naturally fails to exist. >>> >>> I could just do a string replace, but is there a "right" way to escape >>> and unescape URLs? I've looked through the standard lib, but I can't >>> find anything helpful. >> >> This works fine for me: >> >> py> import urllib >> py> fn = >> urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903 >> &c=4551022")[0] >> py> open(fn,"rb").read() >> '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00... >> >> So it's not urlretrieve escaping the url, but something else in your >> code... > > I didn't say it urlretrieve was escaping the URL. I actually think the > URLs are pre-escaped when I scrape them from a HTML file. (Ok, you didn't even menction you were scraping HTML pages...) > I have searched > for, but been unable to find, standard library functions that escapes or > unescapes URLs. Are there any such functions? Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape. How are you scraping the HTML source? Both BeautifulSoup and ElementTree.HTMLTreeBuilder already do that work for you. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list