I'm having a horrible time trying to get xml.dom.pulldom to consume a UTF8 encoded XML file. Here's what I've tried so far:
>>> xml_utf8 = """<?xml version="1.0" encoding="UTF-8" ?> <msg>Simon\xe2\x80\x99s XML nightmare</msg> """ >>> from xml.dom import pulldom >>> parser = pulldom.parseString(xml_utf8) >>> parser.next() ('START_DOCUMENT', <xml.dom.minidom.Document instance at 0x6f06c0>) >>> parser.next() ('START_ELEMENT', <DOM Element: msg at 0x6f0710>) >>> parser.next() ... UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 21: ordinal not in range(128) xml.dom.minidom can handle the string just fine: >>> from xml.dom import minidom >>> dom = minidom.parseString(xml_utf8) >>> dom.toxml() u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>' If I pass a unicode string to pulldom instead of a utf8 encoded bytestring it still breaks: >>> xml_unicode = u'<?xml version="1.0" ?><msg>Simon\u2019s XML nightmare</msg>' >>> parser = pulldom.parseString(xml_unicode) ... /System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/ xml/dom/pulldom.py in parseString(string, parser) 346 347 bufsize = len(string) --> 348 buf = StringIO(string) 349 if not parser: 350 parser = xml.sax.make_parser() UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 32: ordinal not in range(128) Is it possible to consume utf8 or unicode using xml.dom.pulldom or should I try something else? Thanks, Simon Willison -- http://mail.python.org/mailman/listinfo/python-list