On 9/20/06, Adam Olsen <[EMAIL PROTECTED]> wrote:
On 9/20/06, Guido van Rossum <[EMAIL PROTECTED]> wrote:
> On 9/20/06, Adam Olsen <[EMAIL PROTECTED]> wrote:
> > Before we can decide on the internal representation of our unicode
> > objects, we need to decide on their external interface.  My thoughts
> > so far:
>
> Let me cut this short. The external string API in Py3k should not
> change or only very marginally so (like removing rarely used useless
> APIs or adding a few new conveniences). The plan is to keep the 2.x
> API that is supported (in 2.x) by both str and unicode, but merge the
> twp string types into one. Anything else could be done just as easily
> before or after Py3k.

Thanks, but one thing remains unclear: is the indexing intended to
represent bytes, code points, or code units?  Note that C code
operating on UTF-16 would use code units for slicing of UTF-16, which
splits surrogate pairs.

Assuming my Unicode lingo is right and code point represents a letter/character/digraph/whatever, then it will be a code point.  Doing one of my rare channels of Guido, I *really* doubt he wants to expose the technical details of Unicode to the point of having people need to realize that UTF-8 takes two bytes to represent "ö".  If you want that kind of exposure, use the bytes type.  Otherwise assume the usage will be by people ignorant of Unicode and thus want something that will work the way they are used to when compared to working in ASCII.

-Brett
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to