[issue34723] lower() on Turkish letter "İ" returns a 2-chars-long string

STINNER Victor Tue, 18 Sep 2018 07:15:59 -0700

STINNER Victor <[email protected]> added the comment:

> Should it not simply return “i”?


Python implements the Unicode standard.

>>> "U+%04x" % ord("İ")
'U+0130'
>>> ["U+%04x" % ord(ch) for ch in "İ".lower()]
['U+0069', 'U+0307']

>>> unicodedata.name("İ")
'LATIN CAPITAL LETTER I WITH DOT ABOVE'
>>> [unicodedata.name(ch) for ch in "İ".lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

At the C level(), lower_ucs4() calls _PyUnicode_ToLowerFull() which lookup into 
Python internal Unicode database.

U+0130 character enters the EXTENDED_CASE_MASK case: use 
_PyUnicode_ExtendedCase secondary database for "extended case".

Well, at the end, Python uses the following data file from the Unicode standard:

https://www.unicode.org/Public/9.0.0/ucd/SpecialCasing.txt

Extract:
"""
# Preserve canonical equivalence for I with dot. Turkic is handled below.

0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
"""


If you want to convert strings differently for the special case of Turkish, you 
need to use a different standard than Unicode...

I close the issue as NOT A BUG.

----------
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34723>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34723] lower() on Turkish letter "İ" returns a 2-chars-long string

Reply via email to