"Felipe Almeida Lessa" <[EMAIL PROTECTED]> wrote: > On 26 Dec 2006 04:22:38 -0800, placid <[EMAIL PROTECTED]> wrote: >> So do you want to remove "&" or replace them with "&" ? If you >> want to replace it try the following; > > I think he wants to replace them, but just the invalid ones. I.e., > > This & this & that > > would become > > This & this & that > > > No, i don't know how to do this efficiently. =/... > I think some kind of regex could do it. >
Since he's asking for valid xml as output, it isn't sufficient just to ignore entity definitions: HTML has a lot of named entities such as but xml only has a very limited set of predefined named entities. The safest technique is to convert them all to numeric escapes except for the very limited set also guaranteed to be available in xml. Try this: from cgi import escape import re from htmlentitydefs import name2codepoint name2codepoint = name2codepoint.copy() name2codepoint['apos']=ord("'") EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));') def decodeEntities(s, encoding='utf-8'): def unescape(match): code = match.group(1) if code: return unichr(int(code, 10)) else: code = match.group(2) if code: return unichr(int(code, 16)) else: return unichr(name2codepoint[match.group(3)]) return EntityPattern.sub(unescape, s) >>> escape( decodeEntities("This & this & that é")).encode( 'ascii', 'xmlcharrefreplace') 'This & this & that é' P.S. apos is handled specially as it isn't technically a valid html entity (and Python doesn't include it in its entity list), but it is an xml entity and recognised by many browsers so some people might use it in html. -- http://mail.python.org/mailman/listinfo/python-list