Re: URLs and ampersands

Gabriel Genellina Tue, 05 Aug 2008 10:06:01 -0700

En Tue, 05 Aug 2008 06:59:20 -0300, Steven D'Aprano <[EMAIL PROTECTED]> 
escribió:


> On Mon, 04 Aug 2008 23:16:46 -0300, Gabriel Genellina wrote:
>
>> En Mon, 04 Aug 2008 20:43:45 -0300, Steven D'Aprano
>> <[EMAIL PROTECTED]> escribi�:
>>
>>> I'm using urllib.urlretrieve() to download HTML pages, and I've hit a
>>> snag with URLs containing ampersands:
>>>
>>> http://www.example.com/parrot.php?x=1&y=2
>>>
>>> Somewhere in the process, urls like the above are escaped to:
>>>
>>> http://www.example.com/parrot.php?x=1&amp;y=2
>>>
>>> which naturally fails to exist.
>>>
>>> I could just do a string replace, but is there a "right" way to escape
>>> and unescape URLs? I've looked through the standard lib, but I can't
>>> find anything helpful.
>>
>> This works fine for me:
>>
>> py> import urllib
>> py> fn =
>> urllib.urlretrieve("http://c7.amazingcounters.com/counter.php?i=1516903
>> &c=4551022")[0]
>> py> open(fn,"rb").read()
>> '\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00...
>>
>> So it's not urlretrieve escaping the url, but something else in your
>> code...
>
> I didn't say it urlretrieve was escaping the URL. I actually think the
> URLs are pre-escaped when I scrape them from a HTML file. 

(Ok, you didn't even menction you were scraping HTML pages...)

> I have searched
> for, but been unable to find, standard library functions that escapes or
> unescapes URLs. Are there any such functions?

Yes: cgi.escape/unescape, and xml.sax.saxutils.escape/unescape.
How are you scraping the HTML source? Both BeautifulSoup and 
ElementTree.HTMLTreeBuilder already do that work for you.

-- 
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: URLs and ampersands

Reply via email to