Duncan Booth skrev: > "Felipe Almeida Lessa" <[EMAIL PROTECTED]> wrote: > > >> On 26 Dec 2006 04:22:38 -0800, placid <[EMAIL PROTECTED]> wrote: >> >>> So do you want to remove "&" or replace them with "&" ? If you >>> want to replace it try the following; >>> >> I think he wants to replace them, but just the invalid ones. I.e., >> >> This & this & that >> >> would become >> >> This & this & that >> >> >> No, i don't know how to do this efficiently. =/... >> I think some kind of regex could do it. >> >> > > Since he's asking for valid xml as output, it isn't sufficient just to > ignore entity definitions: HTML has a lot of named entities such as > but xml only has a very limited set of predefined named entities. > The safest technique is to convert them all to numeric escapes except > for the very limited set also guaranteed to be available in xml. > > Try this: > > from cgi import escape > import re > from htmlentitydefs import name2codepoint > name2codepoint = name2codepoint.copy() > name2codepoint['apos']=ord("'") > > EntityPattern = > re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') > > def decodeEntities(s, encoding='utf-8'): > def unescape(match): > code = match.group(1) > if code: > return unichr(int(code, 10)) > else: > code = match.group(2) > if code: > return unichr(int(code, 16)) > else: > return unichr(name2codepoint[match.group(3)]) > return EntityPattern.sub(unescape, s) > > >>>> escape( >>>> > decodeEntities("This & this & that é")).encode( > 'ascii', 'xmlcharrefreplace') > 'This & this & that é' > > > P.S. apos is handled specially as it isn't technically a > valid html entity (and Python doesn't include it in its entity > list), but it is an xml entity and recognised by many browsers so some > people might use it in html. > Hey i fund this site: http://www.htmlhelp.com/reference/html40/entities/symbols.html
I hope that its what you mean. /Scripter47 -- http://mail.python.org/mailman/listinfo/python-list