Unescaping URLs in Python

John Nagle Sun, 24 Dec 2006 19:56:01 -0800

Here's a URL from a link on the home page of a major company.

        <a href="/adsk/servlet/index?siteID=123112&amp;id=1860142">About Us</a>


Yes, that "&amp;" is in the source text of the page.

This is, in fact, correct HTML. See

        http://www.htmlhelp.com/tools/validator/problems.html#amp

     What's the appropriate Python function to call to unescape a URL which 
might
contain things like that?  Will this interfere with the usual "%" type escapes
in URLs?

     What's actually needed to get this right is something that goes from
HTML escaped form to URL escaped form, because, in general, there is no
unescaped form that will work for all URLs.

There's "htmldecode" at "http://zesty.ca/python/scrape.py";, which works,
but this should be a standard library function.
                                
                                John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Unescaping URLs in Python

Reply via email to