James Y Knight wrote:
> That seems backwards of how it should be ideally: the byte-string upper
> and lower should always do ascii uppering-and-lowering, and the unicode
> ones should do it according to locale. Perhaps that can be cleaned up in
> py3k?

Cleaned-up, yes. But it is currently not backwards.

For a byte string, you need an encoding, which comes from the locale.
So for byte strings, case-conversion *has* to be locale-aware (in
principle, making it encoding-aware only would almost suffice, but
there is no universal API for that).

OTOH, for Unicode, due to the unification, case-conversion mostly
does not need to be locale-aware. Nearly all case-conversions are
only script-dependent, not language-dependent. So it is nearly possible
to make case-conversion locale-independent, and that is what Python
provides.

The "nearly" above refers to *very* few exceptions, in *very*
few languages. Most of the details are collected in UAX#21, some
highlights are:
- case conversions are not always reversible
- sometimes, case conversion may convert a single
  character to multiple characters; the canonical
  example is German ß (considered lower-case) -> "SS"
  (historically, this is just typographical, since there
   is no upper case sharp s in our script)
- sometimes, conversion depends on the position of
  the letter in the word, see Final_Sigma
  in SpecialCasing.txt, or on the subsequent
  combining accents, see Lithuanian More_Above

I believe the unicode.lower behaviour is currently right
for most applications, so it should continue to be the
default. An additional locale-aware version should be added,
but that probably means to incorporate ICU into Python,
to get this and other locale properties right in a
platform-independent fashion.

Regards,
Martin

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to