Ian Bicking, 26.06.2010 00:26:
On Fri, Jun 25, 2010 at 4:02 PM, Guido van Rossum wrote:
On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz
I'd like a version of 'decode' which would give me a type that was, in
every
respect, unicode, and responded to all protocols exactly as other
unicode objects (or "str objects", if you prefer py3 nomenclature ;-))
do,
but wouldn't actually copy any of that memory unless it really needed to
(for example, to pass to a C API that expected native wide characters),
and
that would hold on to the original bytes so that it could produce them on
demand if encoded to the same encoding again. So, as others in this
thread
have mentioned, the 'ABC' really implies some stuff about C APIs as well.
Well, there's the buffer API, so you can already create something that
refers to an existing C buffer. However, with respect to a string, you will
have to make sure the underlying buffer doesn't get freed while the string
is still in use. That will be hard and sometimes impossible to do at the
C-API level, even if the string is allowed to keep a reference to something
that holds the buffer.
At least in lxml, such a feature would be completely worthless, as text is
never held by any ref-counted Python wrapper object. It's only part of the
XML tree, which is allowed to change at (more or less) any time, so the
underlying char* buffer could just get freed without further notice. Adding
a guard against that would likely have a larger impact on the performance
than the decoding operations.
I'm not sure about the exact performance impact of such a class, which is
why I'd like the ability to implement it *outside* of the stdlib and see
how
it works on a project, and return with a proposal along with some data.
There are also different ways to implement this, and other optimizations
(like ropes) which might be better.
You can almost do this today, but the lack of things like the
hypothetical
"__rcontains__" does make it impossible to be totally transparent about
it.
But you'd still have to validate it, right? You wouldn't want to go on
using what you thought was wrapped UTF-8 if it wasn't actually valid
UTF-8 (or you'd be worse off than in Python 2). So you're really just
worried about space consumption. I'd like to see a lot of hard memory
profiling data before I got overly worried about that.
It wasn't my profiling, but I seem to recall that Fredrik Lundh specifically
benchmarked ElementTree with all-unicode and sometimes-ascii-bytes, and
found that using Python 2 strs in some cases provided notable advantages. I
know Stefan copied ElementTree in this regard in lxml, maybe he also did a
benchmark or knows of one?
Actually, bytes vs. unicode doesn't make that a big difference in Py2 for
lxml. ElementTree is a lot older, so I guess it made a larger difference
when its code was written (and I even think I recall seeing numbers for
lxml where it seemed to make a notable difference).
In lxml, text content is stored in the C tree of libxml2 as UTF-8 encoded
char* text. On request, lxml creates a string object from it and returns
it. In Py2, it checks for plain ASCII content first and returns a byte
string for that. Only non-ASCII strings are returned as decoded unicode
strings. In Py3, it always returns unicode strings.
When I run a little benchmark on lxml in Py2.6.5 that just reads some short
text content from an Element object, I only see a tiny difference between
unicode strings and byte strings. The gap obviously increases when the text
gets longer, e.g. when I serialise the complete text content of an XML
document to either a byte string or a unicode string. But even for
documents in the megabyte range we are still talking about single
milliseconds here, and the difference stays well below 10%. It's seriously
hard to make that the performance bottleneck in an XML application.
Also, since the string objects are only instantiated at request, memory
isn't an issue either. That's different for (c)ElementTree again, where
string content is stored as Python objects. Four times the size even for
plain ASCII strings (e.g. numbers, IDs or even trailing whitespace!) can
well become a problem there, and can easily dominate the overall size of
the in-memory tree. Plain ASCII content is surprisingly common in XML
documents.
Stefan
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com