On Wed, 4 May 2016 12:49 am, Jussi Piitulainen wrote:

> DFS writes:
> 
>> On 5/3/2016 9:13 AM, Chris Angelico wrote:
> 
>>> It doesn't invert, the way numeric negation does.
>>
>> What do you mean by 'case inverted'?
>>
>> It looks like it swaps the case correctly between upper and lower.
> 
> There's letters that do not come in exact pairs of upper and lower case,

Languages with two distinct lettercases, like English, are called bicameral.
The two cases are technically called majuscule and minuscule, but
colloquially known as uppercase and lowercase since movable type printers
traditionally used to keep the majuscule letters in a drawer above the
minuscule letters.

Many alphabets are unicameral, that is, they only have a single lettercase.
Examples include Hebrew, Arabic, Hangul, and many others. Georgian is an
interesting example, as it is the only known written alphabet that started
as a bicameral script and then became unicameral.

Consequently, many letters are neither upper nor lower case, and have
Unicode category "Letter other":

py> c = u'\N{ARABIC LETTER FEH}'
py> unicodedata.category(c)
'Lo'
py> c.isalpha()
True
py> c.isupper()
False
py> c.islower()
False


Even among bicameral alphabets, there are a few anomalies. The three most
obvious ones are Greek sigma, German Eszett (or "sharp S") and Turkish I.

(1) The Greek sigma is usually written as Σ or σ in uppercase and lowercase
respectively, but at the end of a word, lowercase sigma is written as ς.

(This final sigma is sometimes called "stigma", but should not be confused
with the archaic Greek letter stigma, which has two cases Ϛ ϛ, at least
when it is not being written as digamma Ϝϝ -- and if you're confused, so
are the Greeks :-)

Python 3.3 correctly handles the sigma/final sigma when upper- and
lowercasing:

py> 'ΘΠΣΤΣ'.lower()
'θπστς'

py> 'ΘΠΣΤΣ'.lower().upper()
'ΘΠΣΤΣ'



(2) The German Eszett ß traditionally existed in only lowercase forms, but
despite the existence of an uppercase form since at least the 19th century,
when the Germans moved away from blackletter to Roman-style letters, the
uppercase form was left out. In recent years, printers in Germany have
started to reintroduce an uppercase version, and the German government have
standardized on its use for placenames, but not other words.

(Aside: in Germany, ß is not considered a distinct letter of the alphabet,
but a ligature of ss; historically it derived from a ligature of ſs, ſz or
ſʒ. The funny characters you may or may not be able to see are the long-S
and round-Z.)

Python follows common, but not universal, German practice for eszett:

py> 'ẞ'.lower()
'ß'
py> 'ß'.upper()
'SS'

Note that this is lossy: given a name like "STRASSER", it is impossible to
tell whether it should be title-cased to "Strasser" or "Straßer". It also
means that uppercasing a string can make it longer.


For more on the uppercase eszett, see:

https://typography.guru/journal/germanys-new-character/
https://typography.guru/journal/how-to-draw-a-capital-sharp-s-r18/


(3) In most Latin alphabets, the lowercase i and j have a "tittle" diacritic
on them, but not the uppercase forms I and J. Turkish and a few other
languages have both I-with-tittle and I-without-tittle.

(As far as I know, there is no language with a dotless J.)

So in Turkish, the correct uppercase to lowercase and back again should go:

Dotless I: I -> ı -> I

Dotted I: İ -> i -> İ

Python does not quite manage to handle this correctly for Turkish
applications, since it loses the dotted/dotless distinction:

py> 'ı'.upper()
'I'
py> 'İ'.lower()
'i'

and further case conversions follow the non-Turkish rules.

Note that sometimes getting this wrong can have serious consequences:

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to