Re: [Python-Dev] thoughts on the bytes/string discussion

Stefan Behnel Sat, 26 Jun 2010 02:36:56 -0700

Ian Bicking, 26.06.2010 00:26:

On Fri, Jun 25, 2010 at 4:02 PM, Guido van Rossum wrote:

On Fri, Jun 25, 2010 at 1:43 PM, Glyph Lefkowitz

I'd like a version of 'decode' which would give me a type that was, in

every

respect, unicode, and responded to all protocols exactly as other
unicode objects (or "str objects", if you prefer py3 nomenclature ;-))

do,

but wouldn't actually copy any of that memory unless it really needed to
(for example, to pass to a C API that expected native wide characters),

and

that would hold on to the original bytes so that it could produce them on
demand if encoded to the same encoding again. So, as others in this

thread

have mentioned, the 'ABC' really implies some stuff about C APIs as well.

Well, there's the buffer API, so you can already create something thatrefers to an existing C buffer. However, with respect to a string, you willhave to make sure the underlying buffer doesn't get freed while the stringis still in use. That will be hard and sometimes impossible to do at theC-API level, even if the string is allowed to keep a reference to somethingthat holds the buffer.

At least in lxml, such a feature would be completely worthless, as text isnever held by any ref-counted Python wrapper object. It's only part of theXML tree, which is allowed to change at (more or less) any time, so theunderlying char* buffer could just get freed without further notice. Addinga guard against that would likely have a larger impact on the performancethan the decoding operations.

I'm not sure about the exact performance impact of such a class, which is
why I'd like the ability to implement it *outside* of the stdlib and see

how

it works on a project, and return with a proposal along with some data.
  There are also different ways to implement this, and other optimizations
(like ropes) which might be better.
You can almost do this today, but the lack of things like the

hypothetical

"__rcontains__" does make it impossible to be totally transparent about

it.

But you'd still have to validate it, right? You wouldn't want to go on
using what you thought was wrapped UTF-8 if it wasn't actually valid
UTF-8 (or you'd be worse off than in Python 2). So you're really just
worried about space consumption. I'd like to see a lot of hard memory
profiling data before I got overly worried about that.


It wasn't my profiling, but I seem to recall that Fredrik Lundh specifically
benchmarked ElementTree with all-unicode and sometimes-ascii-bytes, and
found that using Python 2 strs in some cases provided notable advantages.  I
know Stefan copied ElementTree in this regard in lxml, maybe he also did a
benchmark or knows of one?

Actually, bytes vs. unicode doesn't make that a big difference in Py2 forlxml. ElementTree is a lot older, so I guess it made a larger differencewhen its code was written (and I even think I recall seeing numbers forlxml where it seemed to make a notable difference).

In lxml, text content is stored in the C tree of libxml2 as UTF-8 encodedchar* text. On request, lxml creates a string object from it and returnsit. In Py2, it checks for plain ASCII content first and returns a bytestring for that. Only non-ASCII strings are returned as decoded unicodestrings. In Py3, it always returns unicode strings.

When I run a little benchmark on lxml in Py2.6.5 that just reads some shorttext content from an Element object, I only see a tiny difference betweenunicode strings and byte strings. The gap obviously increases when the textgets longer, e.g. when I serialise the complete text content of an XMLdocument to either a byte string or a unicode string. But even fordocuments in the megabyte range we are still talking about singlemilliseconds here, and the difference stays well below 10%. It's seriouslyhard to make that the performance bottleneck in an XML application.

Also, since the string objects are only instantiated at request, memoryisn't an issue either. That's different for (c)ElementTree again, wherestring content is stored as Python objects. Four times the size even forplain ASCII strings (e.g. numbers, IDs or even trailing whitespace!) canwell become a problem there, and can easily dominate the overall size ofthe in-memory tree. Plain ASCII content is surprisingly common in XMLdocuments.


Stefan

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] thoughts on the bytes/string discussion

Reply via email to