Guido van Rossum wrote: > 2008/8/24 "Martin v. Löwis" <[EMAIL PROTECTED]>: >> Parsing Unicode XML strings isn't quite that meaningful. > > Maybe not according to the XML standard, but I can see lots of > practical situations where the encoding is always known and applied by > some other layer, i.e. the I/O library or a database wrapper. Forcing > XML to be interpreted as binary isn't always the best idea. E.g. > consider storing XML in a SVN repository. Or consider storing XML > fragments in Python string literals.
lxml handles XML data in unicode strings nicely. The reasoning is that the XML spec says in 4.3.3: """ In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration [...] """ On a given platform, the internal encoding of a Python unicode string is well defined, which means it is as good as an encoding provided by a transport protocol. So this works as long as the XML content of the unicode string does not specify a wrong encoding itself (in which case the parser must reject it). Another reason why lxml handles this is that it also has great support for HTML. In the HTML world, unicode data is a lot easier to handle than the average byte encoded page that doesn't provide any encoding information at all. Stefan _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com