Re: BeautifulSoup vs. loose & chars

Duncan Booth Tue, 26 Dec 2006 09:11:06 -0800

"Felipe Almeida Lessa" <[EMAIL PROTECTED]> wrote:

> On 26 Dec 2006 04:22:38 -0800, placid <[EMAIL PROTECTED]> wrote:
>> So do you want to remove "&" or replace them with "&amp;" ? If you
>> want to replace it try the following;
> 
> I think he wants to replace them, but just the invalid ones. I.e.,
> 
> This & this &amp; that
> 
> would become
> 
> This &amp; this &amp; that
> 
> 
> No, i don't know how to do this efficiently. =/...
> I think some kind of regex could do it.
>


Since he's asking for valid xml as output, it isn't sufficient just to
ignore entity definitions: HTML has a lot of named entities such as
&nbsp; but xml only has a very limited set of predefined named entities.
The safest technique is to convert them all to numeric escapes except
for the very limited set also guaranteed to be available in xml. 

Try this:

from cgi import escape
import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern =
re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') 

def decodeEntities(s, encoding='utf-8'): 
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                return unichr(name2codepoint[match.group(3)])
    return EntityPattern.sub(unescape, s)

>>> escape(
    decodeEntities("This & this &amp; that&nbsp;&eacute;")).encode(
        'ascii', 'xmlcharrefreplace') 
'This &amp; this &amp; that&#160;&#233;'


P.S. apos is handled specially as it isn't technically a
valid html entity (and Python doesn't include it in its entity
list), but it is an xml entity and recognised by many browsers so some
people might use it in html.
 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: BeautifulSoup vs. loose & chars

Reply via email to