James Y Knight wrote: > That seems backwards of how it should be ideally: the byte-string upper > and lower should always do ascii uppering-and-lowering, and the unicode > ones should do it according to locale. Perhaps that can be cleaned up in > py3k?
Cleaned-up, yes. But it is currently not backwards. For a byte string, you need an encoding, which comes from the locale. So for byte strings, case-conversion *has* to be locale-aware (in principle, making it encoding-aware only would almost suffice, but there is no universal API for that). OTOH, for Unicode, due to the unification, case-conversion mostly does not need to be locale-aware. Nearly all case-conversions are only script-dependent, not language-dependent. So it is nearly possible to make case-conversion locale-independent, and that is what Python provides. The "nearly" above refers to *very* few exceptions, in *very* few languages. Most of the details are collected in UAX#21, some highlights are: - case conversions are not always reversible - sometimes, case conversion may convert a single character to multiple characters; the canonical example is German ß (considered lower-case) -> "SS" (historically, this is just typographical, since there is no upper case sharp s in our script) - sometimes, conversion depends on the position of the letter in the word, see Final_Sigma in SpecialCasing.txt, or on the subsequent combining accents, see Lithuanian More_Above I believe the unicode.lower behaviour is currently right for most applications, so it should continue to be the default. An additional locale-aware version should be added, but that probably means to incorporate ICU into Python, to get this and other locale properties right in a platform-independent fashion. Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com