Glenn Linderman writes: > I found your discussion of streams versus arrays, as separate concepts > related to Unicode, along with Terry's bisect indexing implementation, > to rather inspiring. Just because Unicode defines streams of codeunits > of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when > processes communicate and for storage (which is one way processes > communicate), that doesn't imply that the internal representation of > character strings in a programming language must use exactly that > representation.
That is true, and Unicode is *very* careful to define its requirements so that is true. That doesn't mean using an alternative representation is an improvement, though. > I'm unaware of any current Python implementation that has chosen to > use UTF-8 as the internal representation of character strings (I'm > also aware Perl has made that choice), yet UTF-8 is one of the > commonly recommend character representations on the Linux platform, > from what I read. There are two reasons for that. First, widechar representations are right out for anything related to the file system or OS, unless you are prepared to translate before passing to the OS. If you use UTF-8, then asking the user to use a UTF-8 locale to communicate with your app is a plausible way to eliminate any translation in your app. (The original moniker for UTF-8 was UTF-FSS, where FSS stands for "file system safe.") Second, much text processing is stream-oriented and one-pass. In those cases, the variable-width nature of UTF-8 doesn't cost you anything. Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text. It costs *them* nothing and is file-system-safe. > So in that sense, Python has rejected the idea of using the > "native" or "OS configured" representation as its internal > representation. I can't agree with that characterization. POSIX defines the concept of *locale* precisely because the "native" representation of text in Unix is ASCII. Obviously that won't fly, so they solved the problem in the worst possible way<wink/>: they made the representation variable! It is the *variability* of text representation that Python rejects, just as Emacs and Perl do. They happen to have chosen six different representations.[1] > So why, then, must one choose from a repertoire of Unicode-defined > stream representations if they don't meet the goal of efficient > length, indexing, or slicing operations on actual characters? One need not. But why do anything else? It's not like the authors of that standard paid no attention to various concerns about efficiency and backward compatibility! That's the question that you have not answered, and I am presently lacking in any data that suggests I'll ever need the facilities you propose. Footnotes: [1] Emacs recently changed its mind. Originally it used the so-called MULE encoding, and now a different extension of UTF-8 from Perl. Of course, Python beats that, with narrow, wide, and now PEP-393 representations!<wink /> _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com