Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

Martin v. Löwis Tue, 25 Oct 2005 14:21:56 -0700

Guido van Rossum wrote:
> Yes but why? What does this invariant do for him?


I don't know about this person, but there are a few things that
don't work properly in UTF-16 mode:

- the Unicode character database fails to lookup things.
   u"\U0001D670".isupper() gives false, but should give true
   (since it denotes MATHEMATICAL MONOSPACE CAPITAL A).
   It gives true in UCS-4 mode
- As a result, normalization on these doesn't work, either.
   It should normalize to "LATIN CAPITAL LETTER A" under
   NFKC, but doesn't.
- regular expressions only have limited support. In
   particular, adding non-BMP characters to character classes
   is not possible. [\U0001D670] will match any character
   that is either \uD835 or \uDE70, whereas it only matches
   MATHEMATICAL MONOSPACE CAPITAL A in UCS-4 mode.

There might be more limitations, but those are the ones that
come to mind easily. While I could imagine fixing the first
two with some effort, the third one is really tricky (unless
you would accept a "wide" representation of a character
class even if the Unicode representation is only narrow).

Regards,
Martin
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicit conversions).

Reply via email to