>> Parsing Unicode XML strings isn't quite that meaningful. > > Maybe not according to the XML standard, but I can see lots of > practical situations where the encoding is always known and applied by > some other layer, i.e. the I/O library or a database wrapper. Forcing > XML to be interpreted as binary isn't always the best idea. E.g. > consider storing XML in a SVN repository. Or consider storing XML > fragments in Python string literals.
Stefan got it right - a "higher-level protocol" may override the encoding declaration in the XML data. In the case of Python Unicode strings, the data is 16-bit Unicode (or 32-bit), "obviously" overriding the declared encoding (although technically, that protocol needs to explicitly state what encoding takes precedence). So let me rephrase: "Parsing Unicode XML strings may easily lead to parsing problems" (i.e. if the parser hasn't been told that a higher-layer protocol was in place). This is currently the case in 3.0: py> d=xml.dom.minidom.parseString("<?xml version='1.0' encoding='iso-8859-1'?><hallo>\u20ac</hallo>") py> d.documentElement.childNodes[0].data 'â\x82¬' py> list(map(ord,d.documentElement.childNodes[0].data)) [226, 130, 172] Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com