Re: [Python-3000] Handling of wide Unicode characters

Guido van Rossum Fri, 01 Jun 2007 16:25:04 -0700

What he said. IOW, we're treating each half of a surrogate as a
"character", at least for purposes of counting items in a string.
(Otherwise operations like len() and indexing/slicing would no longer
be O(1).)


--Guido

On 6/2/07, Josiah Carlson <[EMAIL PROTECTED]> wrote:
>
> "Alexandre Vassalotti" <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I was doing some testing on the new _string_io module, since I was
> > slightly skeptical on my handling of wide Unicode characters (32-bit
> > of length, instead of the usual 16-bit in UTF-16). So, I ran this
> > little test:
> >
> >    >>> s = _string_io.StringIO()
> >    >>> s.write(u'晉')
> >    >>> s.tell()
> >    2
> >
> > Like I expected, wide Unicode characters count for two. However, I was
> > surprised that Python treats them as two characters as well:
> >
> >    >>> len(u'晉')
> >    2
> >    >>> u'晉'
> >    u'\ud87e\udccd'
> >
> > Is it a bug, or only an implementation choice?
>
> If your Python is compiled as a UTF-16 build, then any character in the
> extended plane will be seen as two characters by Python.  If you are
> using a UCS-4 build (it's the same as UTF-32), then you should be seeing
> the single wide character as a single wide character.  The only
> exception to this rule is if you enter the wide character as a surrogate
> pair, in which case Python doesn't normalize it into the single wide
> character.  To get a real wide character, you would need to use a proper
> escape, or decode from an encoded string.
>
>
>  - Josiah
>
> _______________________________________________
> Python-3000 mailing list
> [email protected]
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-3000/guido%40python.org
>


-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Re: [Python-3000] Handling of wide Unicode characters

Reply via email to