[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

Stefan Behnel Thu, 11 Mar 2010 22:49:12 -0800

Stefan Behnel <sco...@users.sourceforge.net> added the comment:

Hi Guido, your comment was long overdue in this discussion.


Guido van Rossum, 12.03.2010 01:35:
> My thinking was that since an XML document looks like text, it should
> probably be considered text, at least by default.  (There may have
> been some unittests that appeared to require this -- of course this
> was probably just the confusion between byte strings and 8-bit text
> strings inherent in Python 2.)

Well, well, XML...

It does look like text, but it's encoded text that is defined as a stream of 
bytes, and that's the only safe way of dealing with it.

There certainly *is* a use case for treating the serialised result as text, 
that's why lxml has this feature. A minor one is for debug output (which 
certainly doesn't merit being the default), but another one is when dealing 
with HTML, where encoding information is certainly less well defined and *much* 
less often seen in the wild. So users tend to be happy when they get their 
real-world HTML input fixed up into proper Unicode, still happier when they see 
that lxml can parse that correctly and even serialise the result back into a 
Unicode string directly, that they can post-process as text if they need to.

However, the main part here is the input, i.e. getting HTML data properly 
decoded into Unicode. The output part is a lot less important, and it's often 
easier to let lxml.html do the correct serialisation into bytes with proper 
encoding meta information, rather than dealing with it yourself.

Those are the two use cases I see for lxml. Their impact on ElementTree is 
relatively low as it doesn't support *parsing* from a Unicode string, so the 
most important HTML feature isn't there in the first place. The lack of major 
use cases in ElementTree is one of the reasons I'm so opposed to making this 
feature the backwards incompatible default for the output side.


> Regarding backwards compatibility, there are now two backwards
> compatibility problems: with 2.x, and with 3.1.  It seems we cannot
> easily be backwards compatible with both (though if someone figures
> out a way that would be best of course).
> 
> If I were to propose an API for returning a Unicode string, I would
> probably add a new method (e.g. tounicode()) rather than using a
> "magical" argument (tostring(encoding=str)), but given that that
> exists in another supposedly-compatible implementation I'm not
> against it.

Actually, lxml.etree originally had a tounicode() function for this purpose, 
and I deprecated it in favour of tostring(encoding=unicode) to avoid having a 
separate interface for this, while staying just as explicit as before.  I'm 
aware that this wasn't an all-win decision, but I found passing the unicode 
type to be explicit enough, and separate enough from an encoding /name/ to make 
it clear what happens. It's certainly less beautiful in Py3, where you write 
"tostring(encoding=str)".

I still didn't remove the function from the API, but it's been deprecated for 
years. Reactivating it in lxml.etre, and duplicating it in ET would safe 
lxml.etree from having to break user code (as "tostring(encoding=str)" could 
simply continue to work, but disappear from the docs). It wouldn't safe ET-Py3 
from breaking backwards compatibility to itself, though.


> Maybe tostring(encoding=None) could also be made to work? That would
> at least make it *possible* to write code that receives a text object
> and that works in 3.1 and 3.2 both.  In 2.x I think neither of these
> should work, and there probably isn't a need -- apps needing full
> compatibility will just have to refrain from calling tostring()
> without arguments.

It could be made to work, and it doesn't even read that bad. I can't imagine 
anyone using this explicitly to get the default behaviour, although you never 
know how people put together their keyword argument dicts programmatically. 
'None' has always been the documented default for the encoding parameter, so 
I'm sure there's at least a tiny bit of code that uses it to say "I'm not 
overriding the default here".

Actually, the encoding has been a keyword-only parameter in lxml.etree for 
ages, which was ok with the original default and conform with the official ET 
documentation. So it would be easy to switch here, although not beautiful in 
the implementation. Same for ElementTree, where the current default None in the 
signature could simply be replaced by the 'real' default 'us-ascii'. Within the 
Py3 series, this change would not keep up backwards compatibility either.

So, as a solution, I do prefer separating this feature out into a separate 
function, so that we can simplify the interface of tostring() into always 
returning a byte string serialisation, as it always was in ET. The rather 
distinct use case of serialising to an unencoded text string can well be 
handled by a tounicode() function.


> ISTM that the behavior of write() is just fine -- the contents of the
> file will be correct after all.

Not according to the Py3.2 dev docs of open():

"""
'encoding' is the name of the encoding used to decode or encode the file. This 
should only be used in text mode. The default encoding is platform dependent 
(whatever locale.getpreferredencoding() returns)
"""

So if a users "preferred encoding" is not UTF-8 compatible, then writing out 
the Unicode serialisation will result in an incorrect XML serialisation, as an 
XML byte stream without encoding declaration is assumed to be in UTF-8 by 
specification.

Stefan

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8047>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

Reply via email to