On Nov 23, 2010, at 7:22 PM, James Y Knight wrote:

> On Nov 23, 2010, at 6:49 PM, Greg Ewing wrote:
>> Maybe Python should have used UTF-8 as its internal unicode
>> representation. Then people who were foolish enough to assume
>> one character per string item would have their programs break
>> rather soon under only light unicode testing. :-)
> 
> You put a smiley, but, in all seriousness, I think that's actually the right 
> thing to do if anyone writes a new programming language. It is clearly the 
> right thing if you don't have to be concerned with backwards-compatibility: 
> nobody really needs to be able to access the Nth codepoint in a string in 
> constant time, so there's not really any point in storing a vector of 
> codepoints.
> 
> Instead, provide bidirectional iterators which can traverse the string by 
> byte, codepoint, or by grapheme (that is: the set of combining characters + 
> base character that go together, making up one thing which a human would 
> think of as a character).


I really hope that this idea is not just for new programming languages.  If you 
switch from doing unicode "wrong" to doing unicode "right" in Python, you 
quadruple the memory footprint of programs which primarily store and manipulate 
large amounts of text.

This is especially ridiculous in PyGTK applications, where the GUI's internal 
representation required by the GUI UTF-8 anyway, so the round-tripping of 
string data back and forth to the exploded UTF-32 representation is wasting 
gobs of memory and time.  It at least makes sense when your C library's idea 
about character width and your Python build match up.

But, in a desktop app this is unlikely to be a performance concern; in servers, 
it's a big deal; measurably so.  I am pretty sure that in the server apps that 
I work on, we are eventually going to need our own string type and UTF-8 logic 
that does exactly what James suggested - certainly if we ever hope to support 
Py3.

(I dimly recall that both James and I have made this point before, but it's 
pretty important, so it bears repeating.)

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to