Re: [Python-3000] XML as bytes or unicode?

Martin v. Löwis Sun, 07 Sep 2008 09:02:08 -0700

>> Parsing Unicode XML strings isn't quite that meaningful.
> 
> Maybe not according to the XML standard, but I can see lots of
> practical situations where the encoding is always known and applied by
> some other layer, i.e. the I/O library or a database wrapper. Forcing
> XML to be interpreted as binary isn't always the best idea. E.g.
> consider storing XML in a SVN repository. Or consider storing XML
> fragments in Python string literals.


Stefan got it right - a "higher-level protocol" may override the
encoding declaration in the XML data. In the case of Python Unicode
strings, the data is 16-bit Unicode (or 32-bit), "obviously" overriding
the declared encoding (although technically, that protocol needs to
explicitly state what encoding takes precedence).

So let me rephrase: "Parsing Unicode XML strings may easily lead
to parsing problems" (i.e. if the parser hasn't been told that a
higher-layer protocol was in place). This is currently the case in 3.0:

py> d=xml.dom.minidom.parseString("<?xml version='1.0'
encoding='iso-8859-1'?><hallo>\u20ac</hallo>")
py> d.documentElement.childNodes[0].data
'â\x82¬'
py> list(map(ord,d.documentElement.childNodes[0].data))
[226, 130, 172]

Regards,
Martin
_______________________________________________
Python-3000 mailing list
Python-3000@python.org
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] XML as bytes or unicode?

Reply via email to